Skip to content

Python API

The user-facing surface is intentionally small: a module-level API (acid.init / acid.open / acid.sql / ...) over a process-wide default connection, three classes (Connection, Catalog, Result), a namespace of aggregate constructors (acid.agg), the Registry behind --db / YAML configs, and a typed exception hierarchy.

This page is the canonical reference. Each section opens with a short narrative paragraph and then drops to the generated docstrings — treat the autodoc blocks as the source of truth for exact spellings.

For task-shaped walkthroughs, see the user guide; for installation and environment, the install page.


Module-level API

acid is singleton-by-default: the module-level functions operate on one process-wide default connection, lazily built on first use. You don't have to create or manage a connection object for the common case.

import acid

df = acid.open("gaia_dr3").head(10).to_polars()

acid.init(...) is optional — call it to pin the default connection's configuration (source, workers, memory) before first use. It is singleton-by-default: re-initializing with the same resolved config is a no-op, and with a different config raises ConfigError unless you pass reuse_existing=True (or call acid.shutdown() first). For an explicit, fully isolated connection that bypasses this default, see Connection below.

The full model — lazy first use, the singleton rules, atexit teardown, and when to reach for an explicit connection — is in the Connections guide.

Lifecycle and configuration

acid.init

init(source: 'Union[str, Path, list, dict, Registry, None]' = None, *, workers: "Union[int, Literal['auto'], None]" = None, threads: Optional[int] = None, mem_per_worker_gb: Optional[float] = None, inmem_row_limit: Optional[int] = None, tmpdir: 'Optional[Union[str, Path]]' = None, workers_jemalloc_conf: Optional[str] = None, ram_budget: 'Union[int, str, None]' = None, config: 'Optional[Union[str, Path]]' = None, reuse_existing: bool = False) -> None

Initialize the process-wide default Connection (optional — the first acid.open() / acid.sql.query() lazy-inits one with defaults).

Singleton-by-default (the Ray / Spark getOrCreate shape):

  • not yet initialized → build and stash it;
  • already initialized with the same resolved config → no-op;
  • already initialized with a different config → ConfigError, unless reuse_existing=True (then keep the current Connection and warn that the new args were ignored). Call shutdown first to rebuild.

source / workers / threads / … are as documented on acid.Connection. progress is not an init argument — it's a display preference, set process-wide via configure (so changing it never rebuilds the pool). For an explicit, isolated Connection that bypasses this singleton entirely, construct acid.Connection directly.

acid.shutdown

shutdown() -> None

Tear down the process-wide default Connection (idempotent). The next module-level call lazy-inits a fresh default. Does not affect explicit acid.Connection instances.

acid.is_initialized

is_initialized() -> bool

Whether a process-wide default Connection currently exists.

acid.configure

configure(*, progress: "Union[bool, Literal['auto'], None]" = None) -> None

Set process-wide display defaults (not worker-pool config — that goes through init). Only progress today: mutates the running default Connection in place if one exists, and seeds the next init. Excluded from the singleton fingerprint, so changing it never rebuilds the pool.

Catalogs and queries

These delegate to the default connection (building one lazily on first use). Each mirrors the same-named Connection method documented below.

acid.open

open(name, *, alias=None, columns=None, ra=None, dec=None)

Open a catalog / raw file / in-memory frame on the default Connection. See acid.Connection.open.

acid.register_catalog

register_catalog(name, **spec_kwargs)

Register a catalog on the default Connection. See acid.Connection.register_catalog.

acid.register_file

register_file(name, source, *, ra=None, dec=None)

Spill a raw file / in-memory frame to a virtual catalog and register it under name on the default Connection. See acid.Connection.register_file.

acid.register_moc

register_moc(name, source)

Register a MOC on the default Connection. See acid.Connection.register_moc.

acid.in_cone

in_cone(center, *, radius)

An execution-time cone scope on the default Connection (context manager). See acid.Connection.in_cone.

acid.list_catalogs

list_catalogs()

List the catalogs openable on the default Connection — the Python equivalent of the acid list CLI. Returns a list of CatalogInfo rows (name, margins_arcsec, root, shadowed). See acid.Connection.list_catalogs.

acid.status

status()

Status of the default Connection. See acid.Connection.status.

The SQL escape hatch (acid.sql)

SQL-string entry points live under the acid.sql submodule (the fluent Catalog API is the headline surface; SQL is the escape hatch). Each delegates to the default connection, mirroring the same-named Connection method.

acid.sql.query

query(query, *, output=None, progress=None)

Run a SQL query on the default Connection. See acid.Connection.sql.

acid.sql.validate

validate(query)

Parse + analyze a SQL query on the default Connection (no execution). See acid.Connection.validate.

acid.sql.explain

explain(query)

Explain a SQL query's plan on the default Connection. See acid.Connection.explain.

Discovering downloadable catalogs (acid.archives)

acid.archives groups operations against the download path — the remote archives acid download fetches from. Unlike the catalog/query delegates above, it is connection-independent: it reads the download path from config / ACID_DOWNLOAD_PATH, not a Connection. It is the Python sibling of the acid search CLI command.

import acid

# Catalogs whose name contains "gaia", across the whole download path.
for cat in acid.archives.search("gaia"):
    print(cat.name, cat.margins_arcsec, cat.root)

search() returns a list of CatalogInfo. Every occurrence of a name across the download-path roots is returned; a same-named catalog at a later root is flagged shadowed=True, mirroring how acid download <name> resolves first-wins (the un-shadowed entry is the one that would actually be fetched). pattern filters by case-insensitive substring; cache="refresh" forces a re-crawl past the ~1-hour remote listing cache and cache="off" bypasses it entirely; timeout= / insecure= / workers= tune the remote crawl (the same knobs as the acid search flags).

No acid.archives.download() yet

Discovery is the Python surface today; there is no acid.archives.download() and no flat acid.search_downloads. To fetch a catalog programmatically, shell out to acid download, or point acid.init(...) at a directory you populated with the CLI.

acid.archives.search

search(pattern: 'Optional[str]' = None, *, cache: str = 'use', timeout: float = 300.0, insecure: bool = False, workers: int = 16) -> 'list[CatalogInfo]'

Catalogs available to download across the configured download path.

Returns a list of acid.tools.download.CatalogInfo (name, margins_arcsec, root), merged first-wins across the download-path roots (a name at an earlier root shadows the same name later, mirroring how acid download <name> resolves). pattern filters by case-insensitive substring. cache is one of "use" (default — serve remote listings from the ~1h on-disk cache when fresh), "refresh" (force a re-crawl and rewrite the cache), or "off" (neither read nor write it). timeout (seconds per request) / insecure (skip TLS verification) / workers (crawl parallelism) tune the remote crawl — the same knobs as the acid search flags.

Example::

for cat in acid.archives.search("gaia"):
    print(cat.name, cat.margins_arcsec)

acid.tools.download.CatalogInfo dataclass

One catalog discovered under a search-path root.

The row type returned by catalog discovery: acid list / acid.Connection.list_catalogs over the catalog path, and acid search / acid.archives.search over the download path. root is the root this entry was found at. shadowed is True when an earlier root also has a catalog of this name — so first-wins resolution (acid open / acid download) would use the earlier one, not this. Every occurrence is reported (one CatalogInfo per root that has the name); the un-shadowed one is what resolves.


Connection

The module-level API above shares one connection per process. acid.Connection(...) is the explicit-isolation escape hatch: construct it directly when you need two simultaneous connections, two configs in one process, or library/test isolation from the process-wide default. It is a Connection — an explicit handle that owns a worker pool, an engine, and a catalog registry. Use it as a context manager so the pool is torn down deterministically when you exit the with block.

import acid

with acid.Connection("catalogs.yaml", workers=8) as db:
    df = db.open("gaia_dr3").head(10).to_polars()

The first argument is one of: a directory (auto-discovers HATS catalogs in it), a YAML config (named catalogs and MOCs), or None (relies on the acid.conf config layer / ACID_PATH environment variable).

A Connection resolves catalog names to on-disk HATS catalogs, hands out lazy Catalog handles via open(...), executes raw SQL via sql(...), lets you scope an ad-hoc cone over every query with in_cone(...), and exposes a few introspection methods (status, list_catalogs, validate, explain).

A few invariants worth knowing up front:

  • A Connection cannot be pickled or shared across processes — open a fresh one in each process.
  • The worker pool is started lazily on the first query and reused for the connection's lifetime.
  • db.in_cone(...) blocks do not nest — only one cone may be active at a time. Attempting to nest raises ValidationError. The cone is read at execution time, so a Catalog can be built once and run scoped (inside the block) or full-sky (outside) using the same handle.

acid.Connection

Persistent acid execution context.

The explicit-isolation escape hatch (most callers use the module-level singleton API — acid.init / acid.open / acid.sql.query). Use as a context manager so the worker pool is torn down deterministically::

with acid.Connection("/data/hats") as db:
    gaia = db.open("gaia_dr3")
    result = gaia.head(10).to_astropy()

close

close() -> None

Shut down the worker pool and release resources.

open

open(name_or_path, *, alias: Optional[str] = None, columns: Optional[list[str]] = None, ra: Optional[str] = None, dec: Optional[str] = None) -> 'Catalog'

Open a catalog and return a Catalog handle.

name_or_path is a HATS catalog (resolved below), a raw data file (.parquet/.csv/.fits/.arrow/…), or an in-memory table (ndarray / pandas / polars / pyarrow / astropy). A raw file or in-memory frame is spilled once to a memory-mapped virtual catalog (EXTERNAL-SOURCES.md); ra / dec name its coordinate columns and are required for such a source (the columns are not guessed). They are ignored for a HATS catalog (its coords come from properties).

HATS resolution order:

  1. Absolute path or URL → use directly.
  2. Named entry in the YAML config / explicit register_catalog.
  3. Basename match against the connection's roots.

register_file

register_file(name: str, source, *, ra: Optional[str] = None, dec: Optional[str] = None) -> 'Catalog'

Spill a raw file (or in-memory frame) to a virtual catalog and register it under name, so both db.sql("... FROM <name> ...") and db.open(<name>) resolve it.

This is the registering counterpart to open of a raw file: db.open("targets.csv") returns a fluent Catalog but does not put a name in the registry (so an ad-hoc file can't clobber a configured catalog), which means the SQL escape hatch can't see it. register_file is the explicit opt-in — the table backing the CLI's --open NAME=PATH flag. ra / dec name the coordinate columns and are required (the columns are not guessed).

Returns a Catalog handle for name (like register_catalog).

register_catalog

register_catalog(name: str, **spec_kwargs) -> 'Catalog'

Register a catalog explicitly and return a Catalog handle for it (lower-level alternative to open for callers that want to supply full TableSpec kwargs).

Returns a Catalog, not the underlying TableSpec.

list_catalogs

list_catalogs() -> 'list[CatalogInfo]'

List the catalogs openable on this connection — the same set the acid list CLI prints, as CatalogInfo rows (name, margins_arcsec, root, shadowed).

Crawls this connection's roots with the shared discovery engine (search_downloads) — over local directories, ssh:// hosts, and http(s):// mirrors alike — so a namespaced catalog surfaces as namespace/child, margin-cache siblings are attributed to their parent (not listed as catalogs), and a name occurring at more than one root is flagged shadowed on the later one (acid open resolves first-wins). A root that can't be reached is skipped with a UserWarning, never fatal.

Catalogs registered explicitly (register_catalog / a YAML config) but not surfaced by the crawl are included too (a superset of the CLI, which has no registry) — so this answers "what can I open by name?".

Opt-in discovery: the crawl is O(roots × subdirs) and can be slow on remote roots.

register_moc

register_moc(name: str, source: Union[str, Path, object])

Register a MOC footprint by name.

in_cone

in_cone(center: Union['SkyCoord', tuple], *, radius: Union['Quantity', float]) -> Iterator[None]

Connection-scoped cone restriction — an execution-time scope.

Applies a circular spatial region to every query — fluent or SQL — whose execution happens inside the with block. The cone is read when the query is compiled/run, not when its Catalog was built, so a query can be constructed once and run under different cones (or none)::

q = db.open("gaia").where("phot_g_mean_mag < 18")
with db.in_cone((180.0, 0.0), radius=2 * u.deg):
    near = q.to_polars()   # scoped to the cone
allsky = q.to_polars()     # full sky — same query, no cone

Only one cone may be active at a time. Entering a second in_cone block while one is already on the stack raises ValueError. The naive "geometric intersection of nested cones" semantics is not safe to expose: a true intersection of two non-concentric cones is a lens, not a cone, and the engine's single-cone filter cannot represent it. Rather than silently approximate, we reject the nesting and ask the user to compose into a single cone (or use a MOC via Catalog.in_region).

sql

sql(query: str, *, output: Optional[Union[str, Path]] = None, progress: Union[bool, Literal['auto'], None] = None) -> Result

Execute a SQL query and return a Result.

The active cone (set by any enclosing with db.in_cone(...): block) is read at execution time.

progress overrides the Connection default (§6) for this call: True / False to force, "auto" to detect TTY / IPython, None to inherit.

validate

validate(query: str) -> 'OpPlan'

Parse + analyze, no execution. Returns the engine-neutral operator tree (acid.plan.ops.OpPlan).

explain

explain(query: str) -> str

Return a human-readable summary of the analyzed acid.plan.ops.OpPlan (root, joins, projection, aggregation, ordering, footprint filters). Debugging aid.

The native Polars engine builds per-partition LazyFrames rather than SQL text, so this prints the plan structure rather than a per-partition query string.


Catalog

A Catalog is a lazy, immutable query handle. The two-way split between composition verbs (return new Catalog — cheap, no I/O) and materialization verbs (run the query — read parquet, launch workers, return a result) is the single most useful thing to memorize about the API. The Debug small, run big guide is the task-shaped version of this distinction.

Composition verbs (lazy)

Each returns a new Catalog; the old one is unchanged. Branching is free.

Verb Purpose
where(pred) SQL predicate over row columns. Sticky pre/post position. A post-aggregate where is the HAVING role.
select(*cols) Replace the projection (* until set). Comma-split strings ok.
with_columns(name, fn, *, columns, schema, mode) Add Python-computed column(s) per partition (columns=/schema= required).
limit(n) Lazy LIMIT n (composes further; use head for eager).
in_region(r) MOC restriction — registered name, path, peer Catalog, or mocpy.MOC.
crossmatch(other, *, radius, how, maxmatch, dist_col, suffix, nested, order_by) Spatial XMATCH. how{inner, left} (join type); maxmatch{1, -1, N≥2} (multiplicity). radius must be an astropy Quantity (bare float rejected).
join(other, *, on, how, suffix, nested, order_by) Ordinary equi-join on an integer ID column, or a broadcast join against an in-memory frame. how{inner, left}.
group_by(*keys, localized=False) GROUP BY keys (flat column or aliased expression). localized=True runs an agg.list fold partition-local.
aggregate(**named) Decomposable aggregates from acid.agg; keyword becomes output column name.
collect_lists(*cols, order_by=, descending=) Fold the remaining (single-catalog) columns into per-group lists.
count/sum/mean/min/max/std/var(col) Single-aggregate shortcuts. Global → a scalar; grouped → a chainable Catalog.
sort(*keys, descending=, nulls_last=) ORDER BY. Pair with .limit(K) for top-K.

Materialization verbs (eager — run the query)

Each runs the recorded query end-to-end. The Catalog.to_* methods are convenience: they call execute() and convert in one step. Use .execute() when you want the intermediate Result (to preview with .show(), stream with .batches(), or write to disk).

Verb Returns
head(n) Result (eager LIMIT n)
execute() Result
to_pandas() pandas.DataFrame
to_polars() polars.DataFrame
to_arrow() pyarrow.Table
to_astropy() astropy.table.Table
save(path, *, name, overwrite) A new Catalog. Writes a HATS catalog directory (stays queryable; streaming, any size), registers it under name, returns a fresh handle. Atomic-on-success. A bare name lands under the first writable ACID_PATH root (durably re-openable by name); a single-file extension is a ValidationError pointing at export.
export(path, *, format) pathlib.Path. Gathers the full result in RAM and writes one flat file (csv/parquet/fits, by extension or format=). For results that leave the system; use save for full-sky outputs.

Inspection (cheap; no parquet I/O)

Method / property Returns
columns List of output column names (after collision suffixing).
alias The SQL alias under which this catalog appears in compiled queries.
describe() Dict with name / path / row count / partition count / column types / active cone.
explain() Human-readable summary of the analyzed query plan.

Catalog is hashable and immutable; two handles built from the same operations compare equal and hash to the same value.

acid.Catalog dataclass

A lazy, immutable query handle.

A "concrete" Catalog has no joins, no filters, no projection — referencing one registered catalog with a fixed alias. Composition methods return new Catalogs that build up a query. Materialization methods compile the current state to SQL and run it through the owning Connection.

Equality / hashing are structural and ignore connection liveness so stale Catalogs remain valid dict keys (§3.3).

alias property

alias: str

The SQL alias under which this catalog appears in compiled queries.

connection property

connection: 'Connection'

The owning Connection. Raises if it has been closed.

columns property

columns: list

List of output column names this Catalog would produce.

For a concrete catalog: the cached TableSpec.column_names. For a query catalog: a partial-compile of the projection list (with collision prefixing per §8.1).

describe

describe() -> dict

Return a dict describing the catalog or query (§3.3).

Reads only cached metadata; no parquet or network I/O.

explain

explain() -> str

Return a human-readable summary of the analyzed Plan this Catalog would hand the engine (root, joins, projection, filters, cone / MOC scoping, reduce shape).

in_region

in_region(region) -> 'Catalog'

Restrict this Catalog by a MOC.

Accepted region shapes:

  • Registered catalog name (str, e.g. "object_lc") — uses that catalog's point_map.fits footprint. Mirrors the SQL form IN_MOC(<alias>, '<name>').
  • Peer Catalog handle — same as above, looked up via the handle's own TableSpec.path. Saves typing the name twice when the user already has the handle in scope.
  • Filesystem path / URL (str or Path) to either a FITS MOC file or a HATS catalog directory.
  • Already-built mocpy.MOC or MocSpec.

Resolution order for strings: registered name first (when this Catalog's Connection has a catalog by that name), then filesystem path. Cones (SkyCoord + radius) are NOT accepted here — use with db.in_cone(...): instead (§4.5.1).

where

where(predicate: str) -> 'Catalog'

Add a SQL predicate, placed by composition order.

Placement is structural (_fold_steps): before the first join/Map it's a scan pre-filter; after, it joins the post-spine chain; after an .aggregate() it filters the grouped result (the old HAVING) — it composes over the aggregate output like any other output verb.

select

select(*cols: str) -> 'Catalog'

Replace the projection. Each col is a SQL projection fragment (a bare column, an aliased expression, or a function call). Without a .select(), the projection is *. After an .aggregate() it projects / computes over the aggregate output.

limit

limit(n: int) -> 'Catalog'

Lazy LIMIT — returns a Catalog. Use head for an eager Result.

with_columns

with_columns(name, fn, *, columns=None, schema=None, mode: Optional[Literal['numpy', 'polars']] = None) -> 'Catalog'

Add column(s) computed by a Python function, per partition.

name is a str (single-column form — fn returns one array-like) or a list[str] (multi-column form — fn returns a dict / tuple / pl.DataFrame of those columns). columns= (the input columns) and schema= (output dtype(s) — a numpy-style string, a {name: dtype} dict, or a pa.Schema) are required (no inference; they may instead ride the callable as acid_columns / acid_schema / @acid.function). mode defaults to "numpy" (each input column arrives as an np.ndarray); mode="polars" hands a pl.Series. Leaving mode unset honors a function's attached acid_mode; an explicit mode that conflicts with an attached acid_mode is rejected.

Applied after the spine (post-join) — a .crossmatch() / .join() after a user function is rejected (operand-subtree placement is a follow-up); crossmatch first, then add columns. After an .aggregate() it computes a column over the aggregate output (a post-aggregate Map).

map_partitions

map_partitions(fn, *, schema=None, columns=None) -> 'Catalog'

Replace each partition's frame with fn(df) -> pl.DataFrame.

The body receives the partition's pl.DataFrame and returns its replacement (different rows / schema; the HEALPix partition is preserved). schema= (the output schema — a {name: dtype} dict or a pa.Schema) is required; columns= is an optional projection-narrowing hint (the body still gets every requested column).

A table-form function changes row identity, so a .crossmatch() / .join() after it is rejected — .save() the result first, then join the materialized catalog.

group_by

group_by(*keys: str, localized: bool = False) -> 'Catalog'

Set the GROUP BY keys for a subsequent aggregate.

Each key is a flat column name or a SQL expression, optionally aliased ("floor(mag) AS mag_bin") to name the output column. Group keys appear in the output (keys first), Polars-style.

localized — assert that the keys are localized (every row sharing a key value lives in one HEALPix partition, the HATS nested-association layout — e.g. diaSource by diaObjectId). The aggregate then runs partition-local (phase-1 only, no cross-partition reduce). This is an opt-in optimization with the same contract as the nested equi-join: correct iff the assertion holds; a wrong assertion makes a key that spans partitions appear in multiple output rows with split lists (so leave it off — the cross-partition default — unless you know the layout). Currently supports agg.list aggregates only. See docs/archive/FLUENT-LIST-AGGREGATE.md.

aggregate

aggregate(**named: 'AggExpr') -> 'Catalog'

Aggregate, naming each output column by keyword.

cat.group_by("band").aggregate(n=agg.count(), mean=agg.mean("mag")). Each value must be an acid.AggExpr from the acid.agg constructors. Without a preceding group_by this is a global aggregate (one output row).

count

count(col: Optional[str] = None)

Row count — COUNT(*), or COUNT(col) (non-null) when col is given.

Global (no preceding .group_by(...)) → an int. Grouped → a lazy Catalog with one row per group and a count (count_<col>) column.

sum

sum(col: str)

SUM(col). Global → a scalar; grouped → a lazy Catalog with a sum_<col> column.

mean

mean(col: str)

Mean of col. Global → a scalar; grouped → a lazy Catalog with a mean_<col> column.

min

min(col: str)

MIN(col). Global → a scalar; grouped → a lazy Catalog with a min_<col> column.

max

max(col: str)

MAX(col). Global → a scalar; grouped → a lazy Catalog with a max_<col> column.

std

std(col: str)

Population standard deviation of col. Global → a scalar; grouped → a lazy Catalog with a std_<col> column.

var

var(col: str)

Population variance of col. Global → a scalar; grouped → a lazy Catalog with a var_<col> column.

collect_lists

collect_lists(*cols: str, order_by: Optional[str] = None, descending: bool = False) -> 'Catalog'

Fold the remaining columns of a single catalog into per-group lists.

Sugar over aggregate with acid.agg.list: after a group_by, collect each chosen column into a per-group list<T> named after the column (one output row per group; the group key(s) stay scalar). The headline db.open("diaSource").group_by("diaObjectId") .collect_lists(order_by="midpointMjdTai") light-curve shape, without enumerating agg.list(...) for every column.

*cols are the flat column names to fold (comma-joined strings ok); omitted folds every column except the group key(s) and the HEALPix index column. Naming the columns is the narrowing knob — only those lists are built, so projection pushdown reads only them (+ the key + the order_by column) from parquet. order_by (a flat column, optional ASC/DESC) sorts the elements within every list consistently; descending sets the direction.

Cross-partition by default; pair with group_by(..., localized=True) to fold partition-locally (no reduce) on a localized key — it inherits that path's contract and restrictions. Single-catalog only for now (no preceding crossmatch/join). See docs/archive/COLLECT-LISTS.md.

sort

sort(*keys: str, descending: Union[bool, 'list[bool]', 'tuple[bool, ...]'] = False, nulls_last: Union[bool, 'list[bool]', 'tuple[bool, ...]'] = False) -> 'Catalog'

ORDER BY keys. Pair with limit / head for top-K, or use after aggregate.

Each key is a flat column name, SQL expression, or projection- output name. descending / nulls_last accept a scalar (applied to every key) or a per-key sequence. Replaces any prior ordering. A standalone sort with no limit is rejected at compile time (a full global sort is unsupported — add a limit for top-K).

crossmatch

crossmatch(other: 'Catalog', *, radius, how: Literal['inner', 'left'] = 'inner', maxmatch: int = 1, dist_col: Optional[str] = None, suffix: Optional[str] = None, nested: bool = False, order_by: Optional[str] = None) -> 'Catalog'

Spatial crossmatch with other at the given radius.

Two independent axes (decoupled — every combination is expressible):

  • how — the join type: "inner" (default; drop a left row with no match) or "left" (keep it, with NULL right-side columns).
  • maxmatch — the match multiplicity: 1 (default; the single nearest match per left row), -1 (all matches within the radius — one row per (left, match) pair), or N >= 2 (up to the N nearest matches within the radius). So how="left", maxmatch=-1 keeps every counterpart and the unmatched left rows. maxmatch=0 (and any value < -1) is a ValueError.

radius must be an astropy Quantity with an angular unit (a bare float is rejected — the arcsec-vs-degree ambiguity).

dist_col — when given, inject the great-circle separation (arcsec) as a column of that name (off by default, spec §4a). suffix — override the default _<alias> collision suffix applied to other's columns that clash with the left side (spec §4b).

nested — collect each root object's matches into per-row lists instead of emitting one row per matching pair (Feature B, "nested catalog"; M1.3). The root's own columns stay scalar; every matched-side column (and dist_col if set) becomes a list. A trailing .select(...) lists only the named right-side columns (the rest are never read — projection pushdown). order_by (a flat merged column name, optionally "<col> DESC") sorts the elements within each list. The aggregation is partition-local (grouped on the root), so it runs phase-1 only — no cross-partition reduce.

join

join(other: 'Catalog', *, on, how: Literal['inner', 'left'] = 'inner', suffix: Optional[str] = None, nested: bool = False, order_by: Optional[str] = None) -> 'Catalog'

Ordinary equi-join on an integer ID column (§4.6.1).

other is another Catalog, or an in-memory frame (polars / pandas / numpy-structured / pyarrow / astropy) to broadcast: a flat id→value lookup whose key-matching rows aren't spatially localized. The frame is spilled once to a non-spatial virtual catalog (one memory-mapped Arrow IPC file) and read whole into every worker, then hash-joined locally on the key (key decision #19). A frame has no coordinates — it's a join RHS only, never a crossmatch operand — and nested=True over a frame is not supported yet.

on takes flat (unqualified) column names — the fluent join is provenance-free (no alias.col refs): * a bare column name ("diaObjectId") — used on both sides; must name an integer column of the merged-left frame and of the right catalog; * a tuple (left_flat, right_flat)left_flat names a column of the computed merged-left frame (so you can pick an already collision-suffixed key like "id_b"), right_flat a column of the right catalog (("id_b", "id")).

how is "inner" (default) or "left". Both keys must be integer-ID columns (§4.6.1); use db.sql(...) for arbitrary joins.

nested — collect each left row's join partners into per-row lists instead of emitting one row per matching pair (Feature B, "nested catalog"; the headline objectsource ON objectId light-curve shape). The left row's own columns stay scalar; every right-side column becomes a list. A trailing .select(...) lists only the named right-side columns. order_by (a flat merged column name, optionally "<col> DESC") sorts the elements within each list. Correctness precondition: the right catalog must be localized with this one by the left row's HEALPix pixel (the HATS nested-association layout — e.g. Rubin object + objectForcedSource); the aggregation is partition-local (phase-1 only), so a partner that does not land in the left row's partition is dropped, exactly as the flat .join() would drop it. See docs/archive/NESTED-EQUI-JOIN.md.

execute

execute(*, progress: Optional[Union[bool, Literal['auto']]] = None) -> Result

Compile and run; return the Result.

Returned columns carry the flat suffix-named schema the fluent compiler emits (.columns is the source of truth) — the engine names outputs directly, so there is no boundary rename.

progress overrides the owning Connection's default rich rendering (§4.10): True / False to force on / off, "auto" to detect TTY / IPython, None to inherit.

head

head(n: int = 10, *, progress: Optional[Union[bool, Literal['auto']]] = None) -> Result

Eager LIMIT n. Returns a Result.

See execute for progress semantics.

to_astropy

to_astropy(*, progress: Optional[Union[bool, Literal['auto']]] = None) -> 'astropy.table.Table'

Materialize as an astropy.table.Table (requires astropy).

Converts straight from Arrow — no pandas round-trip — so this works in a pandas-free environment and keeps integer columns integer (nulls become masked, not float-NaN).

export

export(path, *, format: Optional[str] = None, progress: Optional[Union[bool, Literal['auto']]] = None) -> Path

Materialize the query and write it to a single flat file, then return the written Path.

The export counterpart to save: export writes a result that is leaving the system (a CSV / parquet / FITS file for another tool), while save writes a HATS catalog that stays queryable. Sugar for self.execute(progress=progress).export(path, format=format).

Format resolution: an explicit format= ("parquet" / "csv" / "fits") wins; otherwise it is inferred from the path extension (.parquet / .pq → parquet, .csv → csv, .fits / .fit → fits). A path with no usable extension and no format=, or an unrecognized extension, raises ValidationErrorexport never writes HATS; use save for that.

Memory contract: the full result is gathered into memory (via Result.to_arrow) before the single-file write. This is the right tool for target lists, proposal tables, and paper tables — and the wrong one for full-sky outputs. For a result too large to hold in RAM, use save (streaming, partitioned HATS).

See execute for progress semantics.

save

save(path, *, name: Optional[str] = None, overwrite: bool = False, progress: Optional[Union[bool, Literal['auto']]] = None) -> 'Catalog'

Write a HATS catalog (a directory tree) at path, register it under name, and return a fresh Catalog handle bound to the new catalog.

Destination. A bare name (no /, e.g. save("gxt")) joins the catalog library: it lands under the first writable local ACID_PATH root, so the name is durably re-openable in a later session (acid.open("gxt") / ... FROM gxt) with no path bookkeeping — the same model acid download <name> uses. An explicit path (./gxt, /data/gxt, ~/gxt) is used verbatim (cwd-relative). If a bare name already resolves to a catalog at a different location earlier on ACID_PATH, the save is refused with RegistryError (writing it would be unreachable-by-name); overwrite=True does not override this — pick a different name or an explicit path.

For a single flat file (CSV / parquet / FITS) use export instead — a single-file extension on save (save("out.csv")) is a ValidationError (pass a trailing / to force a HATS tree genuinely named out.csv).

Atomic-on-success: the query writes to a sibling staging directory; the existing path and the existing registry entry are only removed once the write completes. If the query fails partway through, the original target is preserved intact.

See execute for progress semantics.


Result

Result is the wrapper every materialization call hands back. It holds its data either in memory (a pyarrow.Table) or on disk (a directory of per-partition Parquet files, after a spill or a HATS-style write); the converter and writer methods work transparently on both.

Result and Catalog share the same converter and writer names — to_arrow / to_polars / to_pandas / to_astropy, export(path, format=), and save(path) — so whichever object you have in hand, the spelling is the same. The tables below show the two columns side by side.

Conversions to in-memory tables

Target type On a Catalog On a Result
pandas.DataFrame cat.to_pandas() r.to_pandas()
polars.DataFrame cat.to_polars() r.to_polars()
pyarrow.Table cat.to_arrow() r.to_arrow()
astropy.table.Table cat.to_astropy() r.to_astropy()
list[dict] r.to_pylist()

Writers (single-file or HATS tree)

Target On a Catalog On a Result
HATS catalog tree cat.save(path, name=...) (streams; registers it on the connection) — (a Result has left the system; write HATS from the Catalog, or Connection.sql(query, output=dir))
Single Parquet file cat.export(path) (.parquet) r.export(path) (.parquet)
Single CSV file cat.export(path) (.csv) r.export(path) (.csv)
Single FITS binary table cat.export(path) (.fits) r.export(path) (.fits)
Explicit format override cat.export(path, format=...) r.export(path, format=...)

The Catalog.save(...) path is the one you almost always want when the output will feed another ACID query: it writes the HATS tree, re-registers it under a name, and returns a fresh Catalog pointing at the saved tree. Result.export is for handing data to a non-ACID consumer (a plotting script, a colleague's pipeline); a missing or unrecognized extension raises ValidationError, with a message pointing at save.

Streaming and previewing

What you want Call
Iterate pyarrow.RecordBatch chunks r.batches(batch_size=None)
Pretty-print the first n rows to stdout r.show(n=20)
Jupyter HTML repr _repr_html_ — automatic
First n rows as a new Result r.head(n)
Row count, schema, column names r.num_rows, r.schema, r.column_names
One column as pyarrow.ChunkedArray r.column(name)

acid.Result dataclass

A materialized query result.

Backing storage is one of
  • in-memory pa.Table (_table is not None), or
  • a directory of per-partition Parquet files (_output_dir is not None), typically a HATS catalog.

Most callers should use .to_arrow(), .to_polars(), or .to_astropy(); code that treats the result as a pa.Table works via the passthrough surface (num_rows, column_names, column(name), to_pylist()).

column

column(name: str) -> pa.ChunkedArray

Return a single column by name as a pa.ChunkedArray.

to_pandas

to_pandas() -> 'pd.DataFrame'

Convert to a pandas DataFrame.

to_pylist

to_pylist() -> list[dict]

Convert to a list of row dicts.

to_arrow

to_arrow() -> pa.Table

Return the result as an in-memory pa.Table.

Loads from disk if the result was spilled or written to a HATS output directory.

to_polars

to_polars()

Convert to a Polars DataFrame (requires polars installed).

to_astropy

to_astropy() -> 'Table'

Convert to an astropy.table.Table (requires astropy).

Built directly from the Arrow table column-by-column — no pandas round-trip (see acid.api._coerce.arrow_to_astropy).

batches

batches(batch_size: Optional[int] = None) -> Iterator[pa.RecordBatch]

Iterate over pa.RecordBatch chunks.

For in-memory results, returns the table's existing batches (or rebatches if batch_size is set). For disk-backed results, streams Parquet without materializing the union in memory.

export

export(path: str | Path, *, format: Optional[str] = None) -> Path

Write the result to path as one flat file and return the written Path.

A Result is already-materialized data — it has left the partitioned system — so it has no stays-in-the-system save: write a HATS catalog with the lazy Catalog.save, or Connection.sql(query, output=dir) / acid query --output dir for the SQL surface. export is the leaves-the-system terminal, same contract as Catalog.export: an explicit format= ("parquet" / "csv" / "fits") wins; otherwise the format is inferred from the path extension (.parquet / .pq → parquet, .csv → csv, .fits / .fit → fits). No usable extension, an unrecognized one, or a non-single-file format raises ValidationErrorexport never writes HATS.

show

show(n: int = 20, *, width: Optional[int] = 10000) -> None

Pretty-print the first n rows to stdout.

Terminal-friendly counterpart to _repr_html_ (which renders in Jupyter). Uses the same Polars-based formatter as the acid query CLI, so the output looks identical to what you'd see piping a query through the command line.

width caps the Polars tbl_width_chars setting; default 10_000 means "don't truncate — let the terminal wrap." Pass None to use Polars's terminal-aware default.

head

head(n: int = 10) -> 'Result'

Return a new Result containing the first n rows (after applying whatever ORDER BY the original query had).

__str__

__str__() -> str

Render the result as a Polars DataFrame.

print(result) converts the result to Polars and prints the DataFrame, so the output carries Polars's shape header ((rows, cols)) and its head/tail row truncation with — no separate row cap here. This materializes the full result in memory (via to_polars / to_arrow); for a large on-disk result, prefer show (first n rows only) or repr for the terse one-line summary.


Aggregate constructors — acid.agg

acid.agg is a namespace of constructor functions for the decomposable aggregates ACID supports. Each returns an AggExpr you pass as a keyword argument to Catalog.aggregate(...); the keyword name becomes the output column name.

from acid import agg

(cat.group_by("band")
    .aggregate(n=agg.count(), mean_mag=agg.mean("mag"))
    .where("n > 100"))

The full set is count, sum, mean, min, max, std, var, all, any, list. There is no agg.median / agg.mode — those are non-decomposable and rejected on both the fluent and SQL surfaces with ValidationError. The escape hatch is to aggregate decomposably and finish in Polars or pandas (r.to_polars().group_by(...).agg(pl.col("mag").median())); see the aggregation guide's "Why no agg.median?" section.

acid.agg module-attribute

agg = _Agg()

acid.AggExpr dataclass

One aggregate in a Catalog.aggregate call.

func is a lowercase name understood by acid.plan.aggregates.decompose_agg; arg is a flat column name (or "*" for COUNT(*)); out_name is the output column name, set by Catalog.aggregate from the keyword argument. Frozen + hashable so it can sit in the Catalog's structural hash.


Python functions — acid.function

@acid.function attaches a UDF's columns / schema / mode metadata once at the definition site, so they don't have to be passed at every with_columns / map_partitions call. On a class it makes a deferred-construction factory for stateful UDFs (a heavy resource built once per worker, never shipped in the task payload). The task-shaped walkthrough is in Python functions on partitions.

import acid

@acid.function(columns=["mag", "err"], schema="f8")
def snr(mag, err):
    return mag / err

cat.with_columns("snr", snr)        # columns=/schema= ride the function

acid.function

function(obj=None, *, columns: Optional[list] = None, schema=None, mode: Optional[str] = None)

Attach UDF metadata to a function, or make a class a deferred factory.

Usable bare (@acid.function) or parameterized (@acid.function(columns=[...], schema="f8", mode="numpy")). On a plain function it sets acid_columns / acid_schema / acid_mode (any not given is left unset) and returns the function. On a class it returns a factory whose calls produce a deferred, callable _Deferred handle carrying the same metadata.


Registry

A Registry is the in-memory map from catalog name → TableSpec (path, RA / Dec columns, HEALPix order, margin cache, schema, …). You rarely build one directly — acid.Connection("/path") and acid.Connection("config.yaml") both produce one for you — but the class is exposed for callers that want to assemble a registry programmatically or merge several roots.

acid.Registry

Resolve catalog names to TableSpec, with HATS auto-detection.

Optionally also holds named MOC footprints, registered via register_moc or a top-level mocs: section in the YAML. They're looked up by the analyzer when it encounters IN_MOC() predicates in a query.

from_directory classmethod

from_directory(path: str | PathLike) -> 'Registry'

Auto-discover HATS catalogs in subdirectories of path.

Each subdirectory containing properties or hats.properties becomes a table named after the directory. point_map.fits files are auto-registered as MOCs.

register_moc

register_moc(name: str, source: Union[str, Path, 'mocpy.MOC', 'np.ndarray', 'MocSpec']) -> 'MocSpec'

Register a MOC by name. source is a FITS path, an in-memory mocpy.MOC, an (N, 2) numpy array of order-29 [lo, hi) ranges, or an already-built MocSpec.

get_moc

get_moc(name: str) -> 'MocSpec'

Return the MOC registered as name, lazily falling back to a registered catalog's point_map.fits footprint when no explicit registration matches.

Resolution order
  1. Explicitly registered MOC named name.
  2. Registered catalog named name whose <path>/point_map.fits exists — auto-loaded and cached under name so subsequent lookups are free.
  3. Otherwise raise RegistryError naming both attempts.

catalog_footprint

catalog_footprint(catalog_name: str) -> 'MocSpec | None'

Return the catalog's footprint MOC loaded from point_map.fits, cached on first access. None when the catalog isn't registered or has no point_map.fits. Used by the analyzer to scope IN_MOC predicates to cells where the catalog actually has data — independent of Registry._mocs so explicit registrations don't shadow the catalog's own footprint.

has_moc

has_moc(name: str) -> bool

Cheap pre-execution check: would get_moc(name) succeed?

Stats the catalog's point_map.fits for the auto-resolution path but doesn't read it.

is_moc_registered

is_moc_registered(name: str) -> bool

Strict variant of has_moc: returns True only when name is an explicitly-registered MOC, not when the catalog-footprint fallback would synthesize one.

Used by the fluent-Catalog compiler hook for idempotent auto-MOC installation — we want to skip re-registration of an identical content-hashed MOC, but not collapse a registered MOC name onto a catalog of the same name.


Errors

All ACID-originated exceptions inherit from acid.AcidError, so a single except acid.AcidError catches every library failure:

import acid

try:
    acid.sql.query("SELECT BOGUS FROM doesnt_exist")
except acid.AcidError as e:
    print("acid said:", e)

The specific subclasses let you handle distinct failure modes separately. Each one is documented at length in the errors reference — that page is the right place to look when you have a specific message in hand and need to know what to type to fix it.

Hierarchy at a glance:

  • AcidError — base class. Carries query, span, hint, suggestion.
    • ParseError — SQL the parser can't handle, or an extension (XMATCH, IN_MOC, inline subquery, CTE) with the wrong shape.
    • ValidationError — query parses but isn't a shape ACID can run (unsupported predicate position, non-decomposable aggregate, margin-cache violation, unknown catalog).
    • RegistryError — catalog / MOC registration problems.
    • ExecutionError — per-partition execution failure (corrupt parquet, OOM, disk full). The first failure aborts the whole job.
    • OutputError — output sink failure (write permission, schema mismatch in streamed write).
    • ConnectionClosedErrorConnection (or Catalog bound to it) used after Connection.close().
    • ConfigErroracid.conf problems (missing file, parse failure, bad value), or acid.init(...) called with a config that conflicts with an already-initialized default connection.

acid.AcidError

Bases: Exception

Base class for all acid errors.

Subclasses (ParseError, ValidationError etc.) inherit the same constructor and renderer. Library callers can catch AcidError to handle every acid-originated failure uniformly.


See also