Skip to content

Connections

acid is module-level and singleton-by-default. You don't have to create or manage a connection object: call acid.open(...) / acid.sql.query(...) and the first call lazily builds one process-wide default connection for you, reusing it across every later call.

import acid

df = (acid.open("gaia_dr3")
          .where("phot_g_mean_mag < 16")
          .select("source_id, ra, dec")
          .head(100)
          .to_polars())

That's the whole loop. The rest of this page covers the two things sitting behind it: how the default connection is configured and torn down, and when you'd reach past it for the explicit acid.Connection(...) escape hatch.

The module-level API

Call Does
acid.open(name_or_path, *, alias=None, columns=None, ra=None, dec=None) Open a catalog on the default connection → a lazy Catalog.
acid.sql.query(query, *, output=None, progress=None) Run SQL on the default connection → a Result.
acid.init(source=None, *, workers="auto", ...) Optional. Pin the default connection's config before first use.
acid.configure(*, progress=...) Set process-wide display defaults (e.g. progress bars).
acid.shutdown() Tear the default connection down (idempotent).
acid.is_initialized() Whether a default connection currently exists.
acid.register_catalog, acid.register_file, acid.register_moc, acid.list_catalogs, acid.in_cone, acid.status The same operations the Connection exposes, delegated to the default connection.
acid.sql.query, acid.sql.validate, acid.sql.explain The SQL escape hatch — SQL-string entry points grouped under the acid.sql submodule.

The mental model: there is one connection per process, created on demand and shared. You only think about a connection object at all when you need more than one of them.

Lazy first use vs. acid.init

acid.init(...) is optional. If you never call it, the first acid.open(...) / acid.sql.query(...) builds the default connection with default settings (source resolved from the ACID_PATH env var, the config layer, or ~/datasets if it exists).

Call acid.init(...) up front only when you want to pin the configuration — a specific source, an explicit worker count, a memory budget:

import acid

acid.init("/data/hats", workers=16)        # pin config once

gaia = acid.open("gaia_dr3")               # uses the pinned connection
df   = gaia.where("phot_g_mean_mag < 16").head(100).to_polars()

acid.init(...) is singleton-by-default, the same getOrCreate shape Ray and Spark use:

  • Not yet initialized → build and stash the default connection.
  • Already initialized with the same resolved config → no-op.
  • Already initialized with a different configConfigError. Call acid.shutdown() first to rebuild, or pass reuse_existing=True to keep the current connection and ignore the new arguments (you'll get a warning that the args were dropped).
acid.init("/data/hats", workers=16)
acid.init("/data/hats", workers=16)   # same config -> no-op
acid.init("/data/hats", workers=32)   # different -> ConfigError

acid.shutdown()
acid.init("/data/hats", workers=32)   # now fine

acid.init arguments

acid.init(
    source=None,                     # dir / YAML / list / dict / Registry / None
    *,
    workers="auto",                  # int or "auto"
    threads=None,                    # per-worker Polars thread budget
    mem_per_worker_gb=None,          # RAM budget per worker for "auto" sizing
    inmem_row_limit=None,            # default 50_000_000 — phase-1 spill threshold
    tmpdir=None,                     # base scratch directory
    workers_jemalloc_conf=None,      # jemalloc tuning for workers
    config=None,                     # path to a specific acid.conf
    reuse_existing=False,            # keep an existing differently-configured conn
)

source is one of:

  • A directory path → used as a root for basename catalog lookup. Each subdirectory with a HATS properties file is openable by name via acid.open(name).
  • A path ending in .yaml / .yml → loaded as a registry config file (see Catalogs and the registry).
  • A list of paths → multiple roots, searched in order.
  • A dict → inline registry config, same shape as YAML.
  • None → resolve from ACID_PATH / the config layer / ~/datasets. Catalogs can also be added later via acid.register_catalog(...) or via absolute paths to acid.open(...).

workers="auto" resolves at construct time via os.sched_getaffinity + cgroup CPU quota when available, falling back to os.cpu_count(). Pin an explicit workers=N when you want reproducible parallelism (benchmarks, CI). See Performance & parallelism for the full worker/thread story.

progress is not an init argument

Progress-bar rendering is a display preference, not worker-pool config, so it lives on acid.configure(...) instead. Changing it never rebuilds the pool.

Display defaults

acid.configure(progress=False)   # silence progress bars process-wide
acid.configure(progress="auto")  # bars when stderr is a TTY / under IPython

acid.configure(progress=...) mutates the running default connection in place (if one exists) and seeds the next acid.init(...). Each materialization call (.execute, .head, .to_polars, acid.sql, ...) can still override per call with progress=....

Teardown

acid.shutdown() tears down the default connection's worker pool and frees the engine. It's idempotent, and the next module-level call lazy-inits a fresh default. You rarely need to call it explicitly: acid registers an atexit hook that tears the default connection down when the interpreter exits, so a script or notebook that just calls acid.open(...) / acid.sql.query(...) cleans up on its own.

Call it explicitly when you want to rebuild with different config (see the singleton rules above) or to release the worker pool early in a long-lived process.

Lazy by design

Two things the default connection does not do, even after acid.init(...):

  • It does not start workers. The pool spins up on the first query (acid.sql.query(...) or any Catalog.execute() / .to_* / .head / .save call). acid.init(...) at the top of a notebook is therefore cheap.
  • It does not walk the source directory. If you want a list of catalogs, ask: acid.list_catalogs(). The walk is opt-in because populating a registry by walking 30 catalog directories just so you can use two of them is wasteful (and slow on remote roots).

What it does do eagerly is validate the source path: a missing or unreadable local directory raises immediately, not on first use. URL sources are not pre-validated (a network round-trip on init is worse than a clean error on first acid.open).

Catalogs are lazy too

acid.open("gaia_dr3") reads properties, _common_metadata, and partition_info.csv — small files, microseconds-to-milliseconds — and hands you back a frozen Catalog dataclass. From there, composition is pure:

gaia   = acid.open("gaia_dr3")                # one metadata read
bright = gaia.where("phot_g_mean_mag < 16")   # no I/O
local  = bright.select("source_id, ra, dec")  # no I/O
narrow = local.limit(1000)                    # no I/O

I/O happens when you ask for results — .head, .execute, .to_polars, .to_astropy, .save. Errors surface there, in exactly the call you wrote.

cat.describe() returns a dict with row count, partition count, column schema, RA/Dec column names, footprint area, and similar — all from cached metadata, no parquet reads.

cat.explain() returns a human-readable summary of the plan acid would compile from the fluent chain, so you can sanity-check what you're about to run.

SQL vs the fluent API

The SQL escape hatch is acid.sql.query(...):

acid.sql.query(query, *, output=None)                # -> Result

acid.sql.query(...) is the full pipeline: phase-1 partition-by-partition, phase-2 reduce when the query needs it (decomposable aggregates, HAVING, top-K ORDER BY ... LIMIT). Use it for anything the fluent verbs don't cover. (Window functions, DISTINCT, COUNT(DISTINCT), bare GROUP BY, and unbounded ORDER BY are rejected.)

r = acid.sql.query("""
    SELECT g.source_id, COUNT(*) AS n
    FROM   gaia AS g
    JOIN   twomass AS t ON XMATCH(r => 1.0, mode => 'all')
    GROUP BY g.source_id
    HAVING COUNT(*) >= 2
""")
print(r)

The fluent verbs (acid.open(...).where(...).crossmatch(...), etc.) compile to the same pipeline acid.sql.query(...) does. Use them when the shape fits — filters, projections, crossmatches, ordinary equi-joins, footprint masks. Drop into SQL when it doesn't — aggregates, ordering, HAVING, anything window.

Cone scope: debug small, run big

acid.in_cone(center, *, radius) is a context manager that scopes a spatial cone to every query whose execution happens inside the with block. Both fluent and SQL see the cone:

import acid
import astropy.units as u

gaia = acid.open("gaia_dr3")

q = gaia.where("phot_g_mean_mag < 16")

# Iterate while debugging in a 1-degree patch:
with acid.in_cone((180, 0), radius=1 * u.deg):
    small = q.to_polars()      # scoped to the cone

# Same query object, full sky — just outside the block:
big = q.to_polars()            # no cone

The cone is execution-time: it's read when the query runs, not captured when the Catalog was built. So you can build a query once and run it scoped (inside the block) or full-sky (outside) using the same Catalog handle. There are no stale handles to worry about.

radius must be an astropy.units.Quantity. Bare floats are rejected so units never get guessed wrong. center is either a SkyCoord (any frame; converted to ICRS) or an (ra_deg, dec_deg) tuple (ICRS assumed).

Two things to know:

  • No nesting. Entering a second in_cone block while one is active raises ValidationError. The true intersection of two non-concentric cones is a lens, not a cone; we refuse rather than silently approximate. Compose the two regions into a single cone (or use a MOC via Catalog.in_region) before entering.
  • One cone at a time, applied uniformly. A cone applies to every coord-bearing alias in a query. Scoping it on the connection makes that visible from where the verb is called, and lets the same with block flip the whole workflow between a debug patch and a full-sky run.

See Sky regions & footprints for the cone vs. MOC comparison.

acid.open and naming

acid.open(name_or_path, *, alias=None, columns=None) resolves in this order:

  1. Absolute path or URL (/, ~, http://, ssh://, ...) → use directly. Verify it points at a HATS catalog.
  2. Named entry in the YAML config (if acid.init(...) was given a YAML) → use the configured path.
  3. Basename match against the connection's roots, searched in order.
  4. Else → RegistryError.

The catalog's alias — what you reference in acid.sql.query(...) (SELECT gaia.source_id ...), and what names a join's collision suffix (_<alias>), the in_region / MOC scope, and self-joins — comes from the basename (lowercase, non-alphanumeric → _, truncated to 16 chars), or from the explicit alias=... you pass. (Fluent .where(...) / .select(...) strings use flat column names, not the alias.) Pass alias=... whenever the default is ugly or you want to open the same catalog twice with different filters:

g1 = acid.open("gaia_dr3", alias="bright").where("phot_g_mean_mag < 16")
g2 = acid.open("gaia_dr3", alias="faint").where("phot_g_mean_mag >= 16")

Materializing intermediates

Catalog.save(path, *, name=None, overwrite=False) writes a query result as a HATS catalog and registers it on the connection under name (default: the basename of path):

nearby = (acid.open("gaia_dr3")
            .crossmatch(acid.open("twomass"), radius=1 * u.arcsec)
            .save("/data/out/nearby", name="nearby"))

# `nearby` is a normal Catalog handle; "nearby" is also resolvable by name.
acid.sql.query("SELECT COUNT(*) FROM nearby").show()

This is the canonical EDA pattern: run a heavy join once, save it, then iterate cheaply against the cached on-disk catalog. The output is a proper HATS tree — lsdb.open_catalog(...) and hats.read_hats(...) read it without modification.

When you only want a single parquet / CSV / FITS file, use .export(...) instead (format from the extension; it gathers the result in memory, so it's for modest outputs — see save vs export):

(acid.open("gaia_dr3")
   .where("phot_g_mean_mag < 16")
   .select("source_id, ra, dec")
   .export("bright.parquet"))

Introspection

A few small calls you'll reach for at the REPL:

  • acid.status() → a ConnectionStatus dataclass: state ("idle", "active", "closed"), workers, engine, queries_executed, catalogs_open.
  • acid.list_catalogs() → crawls the connection's roots (local / ssh / http) and returns CatalogInfo rows (name, margins_arcsec, root, shadowed) — the Python equivalent of acid list. Opt-in discovery.
  • acid.sql.validate(query) → returns the analyzed plan without executing.
  • acid.sql.explain(query) → returns a human-readable summary of the plan acid would run.

Explicit isolation — acid.Connection

The module-level API shares one connection per process. When you need more than one connection, or a connection with config that differs from the process default, construct acid.Connection(...) directly. It's the explicit-isolation escape hatch. Reach for it when you need:

  • Two simultaneous connections — e.g. two different source roots, or the same root at two different worker counts, live at once.
  • Two configs in one process — without the singleton's "same config or ConfigError" rule getting in the way.
  • Library / test isolation — a library that uses acid internally shouldn't fight the host application over the process-wide default; a test that wants a clean pool each time shouldn't leak into the next.

Use it as a context manager so the worker pool tears down deterministically:

import acid

with acid.Connection("/data/hats", workers=8) as db:
    gaia = db.open("gaia_dr3")
    df = (gaia.where("phot_g_mean_mag < 16")
              .select("source_id, ra, dec")
              .head(100)
              .to_polars())

acid.Connection(...) takes the same configuration arguments as acid.init(...) (plus progress=), but it is fully isolated: it does not touch, read, or replace the process-wide default. Its db.open / db.sql / db.in_cone methods mirror the module-level functions exactly — acid.open is db.open on the hidden default connection.

acid.Connection(
    source,                          # dir / YAML / list / dict / None
    *,
    workers="auto",
    threads=None,
    mem_per_worker_gb=None,
    inmem_row_limit=50_000_000,
    tmpdir=None,
    progress="auto",
    workers_jemalloc_conf=None,
)

Lifecycle

A Connection is not thread-safe. Don't share one across threads; open one per thread instead.

A Connection can't be pickled. __getstate__ is not defined; attempting to send one across processes raises.

db.close() (or the with exit) tears down the worker pool gracefully and frees the engine. After close, every method on the connection raises ConnectionClosedError, and every Catalog previously returned by db.open does the same on its next materialization. There's no resurrection.

In notebooks, prefer the module-level API (which the atexit hook cleans up for you) — or, if you want an explicit isolated connection that lives across cells, hold it open and close it when done:

db = acid.Connection("/data", workers=8)
# ... many cells of EDA ...
db.close()

Reuse one connection across many queries

The worker pool is connection-scoped, and pool spin-up dominates short-query time. The module-level default already does this for you — it's one shared connection across the whole process. With an explicit acid.Connection, keep the single instance open across a session or a batch loop rather than reconstructing it per query. See Performance & parallelism — Laptop → cluster scaling for the timing details.