Connections¶
acid is module-level and singleton-by-default. You don't have to
create or manage a connection object: call acid.open(...) /
acid.sql.query(...) and the first call lazily builds one process-wide
default connection for you, reusing it across every later call.
import acid
df = (acid.open("gaia_dr3")
.where("phot_g_mean_mag < 16")
.select("source_id, ra, dec")
.head(100)
.to_polars())
That's the whole loop. The rest of this page covers the two things
sitting behind it: how the default connection is configured and torn
down, and when you'd reach past it for the explicit
acid.Connection(...) escape hatch.
The module-level API¶
| Call | Does |
|---|---|
acid.open(name_or_path, *, alias=None, columns=None, ra=None, dec=None) |
Open a catalog on the default connection → a lazy Catalog. |
acid.sql.query(query, *, output=None, progress=None) |
Run SQL on the default connection → a Result. |
acid.init(source=None, *, workers="auto", ...) |
Optional. Pin the default connection's config before first use. |
acid.configure(*, progress=...) |
Set process-wide display defaults (e.g. progress bars). |
acid.shutdown() |
Tear the default connection down (idempotent). |
acid.is_initialized() |
Whether a default connection currently exists. |
acid.register_catalog, acid.register_file, acid.register_moc, acid.list_catalogs, acid.in_cone, acid.status |
The same operations the Connection exposes, delegated to the default connection. |
acid.sql.query, acid.sql.validate, acid.sql.explain |
The SQL escape hatch — SQL-string entry points grouped under the acid.sql submodule. |
The mental model: there is one connection per process, created on demand and shared. You only think about a connection object at all when you need more than one of them.
Lazy first use vs. acid.init¶
acid.init(...) is optional. If you never call it, the first
acid.open(...) / acid.sql.query(...) builds the default connection with
default settings (source resolved from the ACID_PATH env var, the
config layer, or ~/datasets if it exists).
Call acid.init(...) up front only when you want to pin the
configuration — a specific source, an explicit worker count, a memory
budget:
import acid
acid.init("/data/hats", workers=16) # pin config once
gaia = acid.open("gaia_dr3") # uses the pinned connection
df = gaia.where("phot_g_mean_mag < 16").head(100).to_polars()
acid.init(...) is singleton-by-default, the same getOrCreate
shape Ray and Spark use:
- Not yet initialized → build and stash the default connection.
- Already initialized with the same resolved config → no-op.
- Already initialized with a different config →
ConfigError. Callacid.shutdown()first to rebuild, or passreuse_existing=Trueto keep the current connection and ignore the new arguments (you'll get a warning that the args were dropped).
acid.init("/data/hats", workers=16)
acid.init("/data/hats", workers=16) # same config -> no-op
acid.init("/data/hats", workers=32) # different -> ConfigError
acid.shutdown()
acid.init("/data/hats", workers=32) # now fine
acid.init arguments¶
acid.init(
source=None, # dir / YAML / list / dict / Registry / None
*,
workers="auto", # int or "auto"
threads=None, # per-worker Polars thread budget
mem_per_worker_gb=None, # RAM budget per worker for "auto" sizing
inmem_row_limit=None, # default 50_000_000 — phase-1 spill threshold
tmpdir=None, # base scratch directory
workers_jemalloc_conf=None, # jemalloc tuning for workers
config=None, # path to a specific acid.conf
reuse_existing=False, # keep an existing differently-configured conn
)
source is one of:
- A directory path → used as a root for basename catalog lookup.
Each subdirectory with a HATS
propertiesfile is openable by name viaacid.open(name). - A path ending in
.yaml/.yml→ loaded as a registry config file (see Catalogs and the registry). - A list of paths → multiple roots, searched in order.
- A dict → inline registry config, same shape as YAML.
None→ resolve fromACID_PATH/ the config layer /~/datasets. Catalogs can also be added later viaacid.register_catalog(...)or via absolute paths toacid.open(...).
workers="auto" resolves at construct time via os.sched_getaffinity
+ cgroup CPU quota when available, falling back to os.cpu_count(). Pin
an explicit workers=N when you want reproducible parallelism
(benchmarks, CI). See Performance & parallelism for
the full worker/thread story.
progress is not an init argument
Progress-bar rendering is a display preference, not worker-pool
config, so it lives on acid.configure(...)
instead. Changing it never rebuilds the pool.
Display defaults¶
acid.configure(progress=False) # silence progress bars process-wide
acid.configure(progress="auto") # bars when stderr is a TTY / under IPython
acid.configure(progress=...) mutates the running default connection in
place (if one exists) and seeds the next acid.init(...). Each
materialization call (.execute, .head, .to_polars, acid.sql,
...) can still override per call with progress=....
Teardown¶
acid.shutdown() tears down the default connection's worker pool and
frees the engine. It's idempotent, and the next module-level call
lazy-inits a fresh default. You rarely need to call it explicitly:
acid registers an atexit hook that tears the default connection down
when the interpreter exits, so a script or notebook that just calls
acid.open(...) / acid.sql.query(...) cleans up on its own.
Call it explicitly when you want to rebuild with different config (see the singleton rules above) or to release the worker pool early in a long-lived process.
Lazy by design¶
Two things the default connection does not do, even after
acid.init(...):
- It does not start workers. The pool spins up on the first query
(
acid.sql.query(...)or anyCatalog.execute()/.to_*/.head/.savecall).acid.init(...)at the top of a notebook is therefore cheap. - It does not walk the source directory. If you want a list of
catalogs, ask:
acid.list_catalogs(). The walk is opt-in because populating a registry by walking 30 catalog directories just so you can use two of them is wasteful (and slow on remote roots).
What it does do eagerly is validate the source path: a missing or
unreadable local directory raises immediately, not on first use. URL
sources are not pre-validated (a network round-trip on init is worse
than a clean error on first acid.open).
Catalogs are lazy too¶
acid.open("gaia_dr3") reads properties, _common_metadata, and
partition_info.csv — small files, microseconds-to-milliseconds — and
hands you back a frozen Catalog dataclass. From there, composition is
pure:
gaia = acid.open("gaia_dr3") # one metadata read
bright = gaia.where("phot_g_mean_mag < 16") # no I/O
local = bright.select("source_id, ra, dec") # no I/O
narrow = local.limit(1000) # no I/O
I/O happens when you ask for results — .head, .execute,
.to_polars, .to_astropy, .save. Errors surface there, in exactly
the call you wrote.
cat.describe() returns a dict with row count, partition count,
column schema, RA/Dec column names, footprint area, and similar — all
from cached metadata, no parquet reads.
cat.explain() returns a human-readable summary of the plan acid
would compile from the fluent chain, so you can sanity-check what you're
about to run.
SQL vs the fluent API¶
The SQL escape hatch is acid.sql.query(...):
acid.sql.query(...) is the full pipeline: phase-1 partition-by-partition,
phase-2 reduce when the query needs it (decomposable aggregates,
HAVING, top-K ORDER BY ... LIMIT). Use it for anything the fluent
verbs don't cover. (Window functions, DISTINCT, COUNT(DISTINCT),
bare GROUP BY, and unbounded ORDER BY are rejected.)
r = acid.sql.query("""
SELECT g.source_id, COUNT(*) AS n
FROM gaia AS g
JOIN twomass AS t ON XMATCH(r => 1.0, mode => 'all')
GROUP BY g.source_id
HAVING COUNT(*) >= 2
""")
print(r)
The fluent verbs (acid.open(...).where(...).crossmatch(...), etc.)
compile to the same pipeline acid.sql.query(...) does. Use them when the
shape fits — filters, projections, crossmatches, ordinary equi-joins,
footprint masks. Drop into SQL when it doesn't — aggregates, ordering,
HAVING, anything window.
Cone scope: debug small, run big¶
acid.in_cone(center, *, radius) is a context manager that scopes a
spatial cone to every query whose execution happens inside the
with block. Both fluent and SQL see the cone:
import acid
import astropy.units as u
gaia = acid.open("gaia_dr3")
q = gaia.where("phot_g_mean_mag < 16")
# Iterate while debugging in a 1-degree patch:
with acid.in_cone((180, 0), radius=1 * u.deg):
small = q.to_polars() # scoped to the cone
# Same query object, full sky — just outside the block:
big = q.to_polars() # no cone
The cone is execution-time: it's read when the query runs, not
captured when the Catalog was built. So you can build a query once and
run it scoped (inside the block) or full-sky (outside) using the same
Catalog handle. There are no stale handles to worry about.
radius must be an astropy.units.Quantity. Bare floats are rejected
so units never get guessed wrong. center is either a SkyCoord (any
frame; converted to ICRS) or an (ra_deg, dec_deg) tuple (ICRS
assumed).
Two things to know:
- No nesting. Entering a second
in_coneblock while one is active raisesValidationError. The true intersection of two non-concentric cones is a lens, not a cone; we refuse rather than silently approximate. Compose the two regions into a single cone (or use a MOC viaCatalog.in_region) before entering. - One cone at a time, applied uniformly. A cone applies to every
coord-bearing alias in a query. Scoping it on the connection makes
that visible from where the verb is called, and lets the same
withblock flip the whole workflow between a debug patch and a full-sky run.
See Sky regions & footprints for the cone vs. MOC comparison.
acid.open and naming¶
acid.open(name_or_path, *, alias=None, columns=None) resolves in this
order:
- Absolute path or URL (
/,~,http://,ssh://, ...) → use directly. Verify it points at a HATS catalog. - Named entry in the YAML config (if
acid.init(...)was given a YAML) → use the configured path. - Basename match against the connection's roots, searched in order.
- Else →
RegistryError.
The catalog's alias — what you reference in acid.sql.query(...) (SELECT
gaia.source_id ...), and what names a join's collision suffix
(_<alias>), the in_region / MOC scope, and self-joins — comes from
the basename (lowercase, non-alphanumeric → _, truncated to 16
chars), or from the explicit alias=... you pass. (Fluent .where(...)
/ .select(...) strings use flat column names, not the alias.) Pass
alias=... whenever the default is ugly or you want to open the same
catalog twice with different filters:
g1 = acid.open("gaia_dr3", alias="bright").where("phot_g_mean_mag < 16")
g2 = acid.open("gaia_dr3", alias="faint").where("phot_g_mean_mag >= 16")
Materializing intermediates¶
Catalog.save(path, *, name=None, overwrite=False) writes a query
result as a HATS catalog and registers it on the connection under
name (default: the basename of path):
nearby = (acid.open("gaia_dr3")
.crossmatch(acid.open("twomass"), radius=1 * u.arcsec)
.save("/data/out/nearby", name="nearby"))
# `nearby` is a normal Catalog handle; "nearby" is also resolvable by name.
acid.sql.query("SELECT COUNT(*) FROM nearby").show()
This is the canonical EDA pattern: run a heavy join once, save it, then
iterate cheaply against the cached on-disk catalog. The output is a
proper HATS tree — lsdb.open_catalog(...) and hats.read_hats(...)
read it without modification.
When you only want a single parquet / CSV / FITS file, use
.export(...) instead (format from the extension; it gathers the
result in memory, so it's for modest outputs — see
save vs export):
(acid.open("gaia_dr3")
.where("phot_g_mean_mag < 16")
.select("source_id, ra, dec")
.export("bright.parquet"))
Introspection¶
A few small calls you'll reach for at the REPL:
acid.status()→ aConnectionStatusdataclass:state("idle","active","closed"),workers,engine,queries_executed,catalogs_open.acid.list_catalogs()→ crawls the connection's roots (local / ssh / http) and returnsCatalogInforows(name, margins_arcsec, root, shadowed)— the Python equivalent ofacid list. Opt-in discovery.acid.sql.validate(query)→ returns the analyzed plan without executing.acid.sql.explain(query)→ returns a human-readable summary of the planacidwould run.
Explicit isolation — acid.Connection¶
The module-level API shares one connection per process. When you need
more than one connection, or a connection with config that differs
from the process default, construct acid.Connection(...) directly.
It's the explicit-isolation escape hatch. Reach for it when you need:
- Two simultaneous connections — e.g. two different source roots, or the same root at two different worker counts, live at once.
- Two configs in one process — without the singleton's "same config
or
ConfigError" rule getting in the way. - Library / test isolation — a library that uses
acidinternally shouldn't fight the host application over the process-wide default; a test that wants a clean pool each time shouldn't leak into the next.
Use it as a context manager so the worker pool tears down deterministically:
import acid
with acid.Connection("/data/hats", workers=8) as db:
gaia = db.open("gaia_dr3")
df = (gaia.where("phot_g_mean_mag < 16")
.select("source_id, ra, dec")
.head(100)
.to_polars())
acid.Connection(...) takes the same configuration arguments as
acid.init(...) (plus progress=), but it is fully isolated: it
does not touch, read, or replace the process-wide default. Its
db.open / db.sql / db.in_cone methods mirror the module-level
functions exactly — acid.open is db.open on the hidden default
connection.
acid.Connection(
source, # dir / YAML / list / dict / None
*,
workers="auto",
threads=None,
mem_per_worker_gb=None,
inmem_row_limit=50_000_000,
tmpdir=None,
progress="auto",
workers_jemalloc_conf=None,
)
Lifecycle¶
A Connection is not thread-safe. Don't share one across threads; open
one per thread instead.
A Connection can't be pickled. __getstate__ is not defined;
attempting to send one across processes raises.
db.close() (or the with exit) tears down the worker pool gracefully
and frees the engine. After close, every method on the connection raises
ConnectionClosedError, and every Catalog previously returned by
db.open does the same on its next materialization. There's no
resurrection.
In notebooks, prefer the module-level API (which the atexit hook
cleans up for you) — or, if you want an explicit isolated connection
that lives across cells, hold it open and close it when done:
Reuse one connection across many queries
The worker pool is connection-scoped, and pool spin-up dominates
short-query time. The module-level default already does this for you
— it's one shared connection across the whole process. With an
explicit acid.Connection, keep the single instance open across a
session or a batch loop rather than reconstructing it per query. See
Performance & parallelism — Laptop → cluster scaling
for the timing details.