Python API¶
The user-facing surface is intentionally small: a module-level API
(acid.init / acid.open / acid.sql / ...) over a process-wide
default connection, three classes (Connection, Catalog, Result),
a namespace of aggregate constructors (acid.agg), the Registry
behind --db / YAML configs, and a typed exception hierarchy.
This page is the canonical reference. Each section opens with a short narrative paragraph and then drops to the generated docstrings — treat the autodoc blocks as the source of truth for exact spellings.
For task-shaped walkthroughs, see the user guide; for installation and environment, the install page.
Module-level API¶
acid is singleton-by-default: the module-level functions operate
on one process-wide default connection, lazily built on first use. You
don't have to create or manage a connection object for the common case.
acid.init(...) is optional — call it to pin the default connection's
configuration (source, workers, memory) before first use. It is
singleton-by-default: re-initializing with the same resolved config is a
no-op, and with a different config raises ConfigError unless you
pass reuse_existing=True (or call acid.shutdown() first). For an
explicit, fully isolated connection that bypasses this default, see
Connection below.
The full model — lazy first use, the singleton rules, atexit
teardown, and when to reach for an explicit connection — is in the
Connections guide.
Lifecycle and configuration¶
acid.init ¶
init(source: 'Union[str, Path, list, dict, Registry, None]' = None, *, workers: "Union[int, Literal['auto'], None]" = None, threads: Optional[int] = None, mem_per_worker_gb: Optional[float] = None, inmem_row_limit: Optional[int] = None, tmpdir: 'Optional[Union[str, Path]]' = None, workers_jemalloc_conf: Optional[str] = None, ram_budget: 'Union[int, str, None]' = None, config: 'Optional[Union[str, Path]]' = None, reuse_existing: bool = False) -> None
Initialize the process-wide default Connection (optional — the first
acid.open() / acid.sql.query() lazy-inits one with defaults).
Singleton-by-default (the Ray / Spark getOrCreate shape):
- not yet initialized → build and stash it;
- already initialized with the same resolved config → no-op;
- already initialized with a different config →
ConfigError, unlessreuse_existing=True(then keep the current Connection and warn that the new args were ignored). Callshutdownfirst to rebuild.
source / workers / threads / … are as documented on
acid.Connection. progress is not an init argument — it's
a display preference, set process-wide via configure (so changing it
never rebuilds the pool). For an explicit, isolated Connection that bypasses
this singleton entirely, construct acid.Connection directly.
acid.shutdown ¶
Tear down the process-wide default Connection (idempotent). The next
module-level call lazy-inits a fresh default. Does not affect explicit
acid.Connection instances.
acid.is_initialized ¶
Whether a process-wide default Connection currently exists.
acid.configure ¶
Catalogs and queries¶
These delegate to the default connection (building one lazily on first
use). Each mirrors the same-named Connection method documented below.
acid.open ¶
Open a catalog / raw file / in-memory frame on the default Connection.
See acid.Connection.open.
acid.register_catalog ¶
Register a catalog on the default Connection. See
acid.Connection.register_catalog.
acid.register_file ¶
Spill a raw file / in-memory frame to a virtual catalog and register it
under name on the default Connection. See
acid.Connection.register_file.
acid.register_moc ¶
Register a MOC on the default Connection. See
acid.Connection.register_moc.
acid.in_cone ¶
An execution-time cone scope on the default Connection (context manager).
See acid.Connection.in_cone.
acid.list_catalogs ¶
List the catalogs openable on the default Connection — the Python
equivalent of the acid list CLI. Returns a list of
CatalogInfo rows
(name, margins_arcsec, root, shadowed). See
acid.Connection.list_catalogs.
The SQL escape hatch (acid.sql)¶
SQL-string entry points live under the acid.sql submodule (the fluent
Catalog API is the headline surface; SQL is the escape hatch). Each delegates
to the default connection, mirroring the same-named Connection method.
acid.sql.query ¶
Run a SQL query on the default Connection. See acid.Connection.sql.
acid.sql.validate ¶
Parse + analyze a SQL query on the default Connection (no execution).
See acid.Connection.validate.
acid.sql.explain ¶
Explain a SQL query's plan on the default Connection.
See acid.Connection.explain.
Discovering downloadable catalogs (acid.archives)¶
acid.archives groups operations against the download path — the
remote archives acid download fetches from. Unlike the catalog/query
delegates above, it is connection-independent: it reads the download
path from config / ACID_DOWNLOAD_PATH, not a Connection. It is the
Python sibling of the acid search
CLI command.
import acid
# Catalogs whose name contains "gaia", across the whole download path.
for cat in acid.archives.search("gaia"):
print(cat.name, cat.margins_arcsec, cat.root)
search() returns a list of CatalogInfo. Every occurrence of a
name across the download-path roots is returned; a same-named catalog at
a later root is flagged shadowed=True, mirroring how
acid download <name> resolves first-wins (the un-shadowed entry is the
one that would actually be fetched). pattern filters by case-insensitive
substring; cache="refresh" forces a re-crawl past the ~1-hour remote
listing cache and cache="off" bypasses it entirely; timeout= /
insecure= / workers= tune the remote crawl (the same knobs as the
acid search flags).
No acid.archives.download() yet
Discovery is the Python surface today; there is no
acid.archives.download() and no flat acid.search_downloads. To
fetch a catalog programmatically, shell out to acid download, or
point acid.init(...) at a directory you populated with the CLI.
acid.archives.search ¶
search(pattern: 'Optional[str]' = None, *, cache: str = 'use', timeout: float = 300.0, insecure: bool = False, workers: int = 16) -> 'list[CatalogInfo]'
Catalogs available to download across the configured download path.
Returns a list of acid.tools.download.CatalogInfo
(name, margins_arcsec, root), merged first-wins across the
download-path roots (a name at an earlier root shadows the same name later,
mirroring how acid download <name> resolves). pattern filters by
case-insensitive substring. cache is one of "use" (default —
serve remote listings from the ~1h on-disk cache when fresh),
"refresh" (force a re-crawl and rewrite the cache), or "off"
(neither read nor write it). timeout (seconds per request) /
insecure (skip TLS verification) / workers (crawl parallelism)
tune the remote crawl — the same knobs as the acid search flags.
Example::
for cat in acid.archives.search("gaia"):
print(cat.name, cat.margins_arcsec)
acid.tools.download.CatalogInfo
dataclass
¶
One catalog discovered under a search-path root.
The row type returned by catalog discovery: acid list /
acid.Connection.list_catalogs over the catalog path, and
acid search / acid.archives.search over the download path.
root is the root this entry was found at. shadowed is True when an
earlier root also has a catalog of this name — so first-wins resolution
(acid open / acid download) would use the earlier one, not this.
Every occurrence is reported (one CatalogInfo per root that has the
name); the un-shadowed one is what resolves.
Connection¶
The module-level API above shares one connection per process.
acid.Connection(...) is the explicit-isolation escape hatch:
construct it directly when you need two simultaneous connections, two
configs in one process, or library/test isolation from the process-wide
default. It is a Connection — an explicit handle that owns a worker
pool, an engine, and a catalog registry. Use it as a context manager so
the pool is torn down deterministically when you exit the with block.
import acid
with acid.Connection("catalogs.yaml", workers=8) as db:
df = db.open("gaia_dr3").head(10).to_polars()
The first argument is one of: a directory (auto-discovers HATS catalogs
in it), a YAML config (named catalogs and MOCs), or None (relies on
the acid.conf config layer / ACID_PATH environment variable).
A Connection resolves catalog names to on-disk HATS catalogs, hands
out lazy Catalog handles via open(...), executes raw SQL via
sql(...), lets you scope an ad-hoc cone over every query with
in_cone(...), and exposes a few introspection methods (status,
list_catalogs, validate, explain).
A few invariants worth knowing up front:
- A
Connectioncannot be pickled or shared across processes — open a fresh one in each process. - The worker pool is started lazily on the first query and reused for the connection's lifetime.
db.in_cone(...)blocks do not nest — only one cone may be active at a time. Attempting to nest raisesValidationError. The cone is read at execution time, so aCatalogcan be built once and run scoped (inside the block) or full-sky (outside) using the same handle.
acid.Connection ¶
Persistent acid execution context.
The explicit-isolation escape hatch (most callers use the module-level
singleton API — acid.init / acid.open / acid.sql.query).
Use as a context manager so the worker pool is torn down
deterministically::
with acid.Connection("/data/hats") as db:
gaia = db.open("gaia_dr3")
result = gaia.head(10).to_astropy()
open ¶
open(name_or_path, *, alias: Optional[str] = None, columns: Optional[list[str]] = None, ra: Optional[str] = None, dec: Optional[str] = None) -> 'Catalog'
Open a catalog and return a Catalog handle.
name_or_path is a HATS catalog (resolved below), a raw data file
(.parquet/.csv/.fits/.arrow/…), or an in-memory table
(ndarray / pandas / polars / pyarrow / astropy). A raw file or in-memory
frame is spilled once to a memory-mapped virtual catalog
(EXTERNAL-SOURCES.md); ra / dec name its coordinate columns and
are required for such a source (the columns are not guessed). They
are ignored for a HATS catalog (its coords come from properties).
HATS resolution order:
- Absolute path or URL → use directly.
- Named entry in the YAML config / explicit
register_catalog. - Basename match against the connection's roots.
register_file ¶
register_file(name: str, source, *, ra: Optional[str] = None, dec: Optional[str] = None) -> 'Catalog'
Spill a raw file (or in-memory frame) to a virtual catalog and
register it under name, so both db.sql("... FROM <name> ...")
and db.open(<name>) resolve it.
This is the registering counterpart to open of a raw file:
db.open("targets.csv") returns a fluent Catalog but does
not put a name in the registry (so an ad-hoc file can't clobber a
configured catalog), which means the SQL escape hatch can't see it.
register_file is the explicit opt-in — the table backing the CLI's
--open NAME=PATH flag. ra / dec name the coordinate columns
and are required (the columns are not guessed).
Returns a Catalog handle for name (like register_catalog).
register_catalog ¶
list_catalogs ¶
List the catalogs openable on this connection — the same set the
acid list CLI prints, as CatalogInfo
rows (name, margins_arcsec, root, shadowed).
Crawls this connection's roots with the shared discovery engine
(search_downloads) — over local
directories, ssh:// hosts, and http(s):// mirrors alike — so a
namespaced catalog surfaces as namespace/child, margin-cache
siblings are attributed to their parent (not listed as catalogs), and a
name occurring at more than one root is flagged shadowed on the
later one (acid open resolves first-wins). A root that can't be
reached is skipped with a UserWarning, never fatal.
Catalogs registered explicitly (register_catalog / a YAML config)
but not surfaced by the crawl are included too (a superset of the CLI,
which has no registry) — so this answers "what can I open by name?".
Opt-in discovery: the crawl is O(roots × subdirs) and can be slow on remote roots.
register_moc ¶
Register a MOC footprint by name.
in_cone ¶
Connection-scoped cone restriction — an execution-time scope.
Applies a circular spatial region to every query — fluent or SQL —
whose execution happens inside the with block. The cone is read
when the query is compiled/run, not when its Catalog was built, so a
query can be constructed once and run under different cones (or none)::
q = db.open("gaia").where("phot_g_mean_mag < 18")
with db.in_cone((180.0, 0.0), radius=2 * u.deg):
near = q.to_polars() # scoped to the cone
allsky = q.to_polars() # full sky — same query, no cone
Only one cone may be active at a time. Entering a second
in_cone block while one is already on the stack raises
ValueError. The naive "geometric intersection of
nested cones" semantics is not safe to expose: a true
intersection of two non-concentric cones is a lens, not a
cone, and the engine's single-cone filter cannot represent
it. Rather than silently approximate, we reject the nesting
and ask the user to compose into a single cone (or use a MOC
via Catalog.in_region).
sql ¶
sql(query: str, *, output: Optional[Union[str, Path]] = None, progress: Union[bool, Literal['auto'], None] = None) -> Result
Execute a SQL query and return a Result.
The active cone (set by any enclosing with db.in_cone(...):
block) is read at execution time.
progress overrides the Connection default (§6) for this
call: True / False to force, "auto" to detect
TTY / IPython, None to inherit.
validate ¶
Parse + analyze, no execution. Returns the engine-neutral
operator tree (acid.plan.ops.OpPlan).
explain ¶
Return a human-readable summary of the analyzed
acid.plan.ops.OpPlan (root, joins, projection, aggregation,
ordering, footprint filters). Debugging aid.
The native Polars engine builds per-partition LazyFrames rather than SQL text, so this prints the plan structure rather than a per-partition query string.
Catalog¶
A Catalog is a lazy, immutable query handle. The two-way split
between composition verbs (return new Catalog — cheap, no I/O)
and materialization verbs (run the query — read parquet, launch
workers, return a result) is the single most useful thing to memorize
about the API. The Debug small, run big
guide is the task-shaped version of this distinction.
Composition verbs (lazy)¶
Each returns a new Catalog; the old one is unchanged. Branching is
free.
| Verb | Purpose |
|---|---|
where(pred) |
SQL predicate over row columns. Sticky pre/post position. A post-aggregate where is the HAVING role. |
select(*cols) |
Replace the projection (* until set). Comma-split strings ok. |
with_columns(name, fn, *, columns, schema, mode) |
Add Python-computed column(s) per partition (columns=/schema= required). |
limit(n) |
Lazy LIMIT n (composes further; use head for eager). |
in_region(r) |
MOC restriction — registered name, path, peer Catalog, or mocpy.MOC. |
crossmatch(other, *, radius, how, maxmatch, dist_col, suffix, nested, order_by) |
Spatial XMATCH. how ∈ {inner, left} (join type); maxmatch ∈ {1, -1, N≥2} (multiplicity). radius must be an astropy Quantity (bare float rejected). |
join(other, *, on, how, suffix, nested, order_by) |
Ordinary equi-join on an integer ID column, or a broadcast join against an in-memory frame. how ∈ {inner, left}. |
group_by(*keys, localized=False) |
GROUP BY keys (flat column or aliased expression). localized=True runs an agg.list fold partition-local. |
aggregate(**named) |
Decomposable aggregates from acid.agg; keyword becomes output column name. |
collect_lists(*cols, order_by=, descending=) |
Fold the remaining (single-catalog) columns into per-group lists. |
count/sum/mean/min/max/std/var(col) |
Single-aggregate shortcuts. Global → a scalar; grouped → a chainable Catalog. |
sort(*keys, descending=, nulls_last=) |
ORDER BY. Pair with .limit(K) for top-K. |
Materialization verbs (eager — run the query)¶
Each runs the recorded query end-to-end. The Catalog.to_* methods
are convenience: they call execute() and convert in one step. Use
.execute() when you want the intermediate Result (to preview with
.show(), stream with .batches(), or write to disk).
| Verb | Returns |
|---|---|
head(n) |
Result (eager LIMIT n) |
execute() |
Result |
to_pandas() |
pandas.DataFrame |
to_polars() |
polars.DataFrame |
to_arrow() |
pyarrow.Table |
to_astropy() |
astropy.table.Table |
save(path, *, name, overwrite) |
A new Catalog. Writes a HATS catalog directory (stays queryable; streaming, any size), registers it under name, returns a fresh handle. Atomic-on-success. A bare name lands under the first writable ACID_PATH root (durably re-openable by name); a single-file extension is a ValidationError pointing at export. |
export(path, *, format) |
pathlib.Path. Gathers the full result in RAM and writes one flat file (csv/parquet/fits, by extension or format=). For results that leave the system; use save for full-sky outputs. |
Inspection (cheap; no parquet I/O)¶
| Method / property | Returns |
|---|---|
columns |
List of output column names (after collision suffixing). |
alias |
The SQL alias under which this catalog appears in compiled queries. |
describe() |
Dict with name / path / row count / partition count / column types / active cone. |
explain() |
Human-readable summary of the analyzed query plan. |
Catalog is hashable and immutable; two handles built from the same
operations compare equal and hash to the same value.
acid.Catalog
dataclass
¶
A lazy, immutable query handle.
A "concrete" Catalog has no joins, no filters, no projection — referencing one registered catalog with a fixed alias. Composition methods return new Catalogs that build up a query. Materialization methods compile the current state to SQL and run it through the owning Connection.
Equality / hashing are structural and ignore connection liveness so stale Catalogs remain valid dict keys (§3.3).
columns
property
¶
List of output column names this Catalog would produce.
For a concrete catalog: the cached TableSpec.column_names.
For a query catalog: a partial-compile of the projection list
(with collision prefixing per §8.1).
describe ¶
Return a dict describing the catalog or query (§3.3).
Reads only cached metadata; no parquet or network I/O.
explain ¶
Return a human-readable summary of the analyzed Plan this
Catalog would hand the engine (root, joins, projection, filters,
cone / MOC scoping, reduce shape).
in_region ¶
Restrict this Catalog by a MOC.
Accepted region shapes:
- Registered catalog name (
str, e.g."object_lc") — uses that catalog'spoint_map.fitsfootprint. Mirrors the SQL formIN_MOC(<alias>, '<name>'). - Peer
Cataloghandle — same as above, looked up via the handle's ownTableSpec.path. Saves typing the name twice when the user already has the handle in scope. - Filesystem path / URL (
strorPath) to either a FITS MOC file or a HATS catalog directory. - Already-built
mocpy.MOCorMocSpec.
Resolution order for strings: registered name first (when this
Catalog's Connection has a catalog by that name), then
filesystem path. Cones (SkyCoord + radius) are NOT
accepted here — use with db.in_cone(...): instead (§4.5.1).
where ¶
Add a SQL predicate, placed by composition order.
Placement is structural (_fold_steps): before the first join/Map
it's a scan pre-filter; after, it joins the post-spine chain; after an
.aggregate() it filters the grouped result (the old HAVING) — it
composes over the aggregate output like any other output verb.
select ¶
Replace the projection. Each col is a SQL projection
fragment (a bare column, an aliased expression, or a function
call). Without a .select(), the projection is *. After an
.aggregate() it projects / computes over the aggregate output.
with_columns ¶
with_columns(name, fn, *, columns=None, schema=None, mode: Optional[Literal['numpy', 'polars']] = None) -> 'Catalog'
Add column(s) computed by a Python function, per partition.
name is a str (single-column form — fn returns one
array-like) or a list[str] (multi-column form — fn returns a
dict / tuple / pl.DataFrame of those columns). columns= (the
input columns) and schema= (output dtype(s) — a numpy-style string,
a {name: dtype} dict, or a pa.Schema) are required (no
inference; they may instead ride the callable as acid_columns /
acid_schema / @acid.function). mode defaults to "numpy"
(each input column arrives as an np.ndarray); mode="polars" hands
a pl.Series. Leaving mode unset honors a function's attached
acid_mode; an explicit mode that conflicts with an attached
acid_mode is rejected.
Applied after the spine (post-join) — a .crossmatch() / .join()
after a user function is rejected (operand-subtree placement is a
follow-up); crossmatch first, then add columns. After an .aggregate()
it computes a column over the aggregate output (a post-aggregate Map).
map_partitions ¶
Replace each partition's frame with fn(df) -> pl.DataFrame.
The body receives the partition's pl.DataFrame and returns its
replacement (different rows / schema; the HEALPix partition is
preserved). schema= (the output schema — a {name: dtype} dict or
a pa.Schema) is required; columns= is an optional
projection-narrowing hint (the body still gets every requested column).
A table-form function changes row identity, so a .crossmatch() /
.join() after it is rejected — .save() the result first, then
join the materialized catalog.
group_by ¶
Set the GROUP BY keys for a subsequent aggregate.
Each key is a flat column name or a SQL expression, optionally
aliased ("floor(mag) AS mag_bin") to name the output column.
Group keys appear in the output (keys first), Polars-style.
localized — assert that the keys are localized (every
row sharing a key value lives in one HEALPix partition, the HATS
nested-association layout — e.g. diaSource by diaObjectId). The
aggregate then runs partition-local (phase-1 only, no cross-partition
reduce). This is an opt-in optimization with the same contract as the
nested equi-join: correct iff the assertion holds; a wrong assertion
makes a key that spans partitions appear in multiple output rows with
split lists (so leave it off — the cross-partition default — unless you
know the layout). Currently supports agg.list aggregates only. See
docs/archive/FLUENT-LIST-AGGREGATE.md.
aggregate ¶
Aggregate, naming each output column by keyword.
cat.group_by("band").aggregate(n=agg.count(), mean=agg.mean("mag")).
Each value must be an acid.AggExpr from the acid.agg
constructors. Without a preceding group_by this is a
global aggregate (one output row).
count ¶
Row count — COUNT(*), or COUNT(col) (non-null) when col
is given.
Global (no preceding .group_by(...)) → an int. Grouped → a lazy
Catalog with one row per group and a count (count_<col>) column.
mean ¶
Mean of col. Global → a scalar; grouped → a lazy Catalog with a
mean_<col> column.
std ¶
Population standard deviation of col. Global → a scalar; grouped →
a lazy Catalog with a std_<col> column.
var ¶
Population variance of col. Global → a scalar; grouped → a lazy
Catalog with a var_<col> column.
collect_lists ¶
Fold the remaining columns of a single catalog into per-group lists.
Sugar over aggregate with acid.agg.list: after a
group_by, collect each chosen column into a per-group list<T>
named after the column (one output row per group; the group key(s) stay
scalar). The headline db.open("diaSource").group_by("diaObjectId")
.collect_lists(order_by="midpointMjdTai") light-curve shape, without
enumerating agg.list(...) for every column.
*cols are the flat column names to fold (comma-joined strings ok);
omitted folds every column except the group key(s) and the HEALPix
index column. Naming the columns is the narrowing knob — only those
lists are built, so projection pushdown reads only them (+ the key + the
order_by column) from parquet. order_by (a flat column, optional
ASC/DESC) sorts the elements within every list consistently;
descending sets the direction.
Cross-partition by default; pair with group_by(..., localized=True)
to fold partition-locally (no reduce) on a localized key — it
inherits that path's contract and restrictions. Single-catalog only for
now (no preceding crossmatch/join). See docs/archive/COLLECT-LISTS.md.
sort ¶
sort(*keys: str, descending: Union[bool, 'list[bool]', 'tuple[bool, ...]'] = False, nulls_last: Union[bool, 'list[bool]', 'tuple[bool, ...]'] = False) -> 'Catalog'
ORDER BY keys. Pair with limit / head for
top-K, or use after aggregate.
Each key is a flat column name, SQL expression, or projection-
output name. descending / nulls_last accept a scalar
(applied to every key) or a per-key sequence. Replaces any prior
ordering. A standalone sort with no limit is rejected at compile
time (a full global sort is unsupported — add a limit for top-K).
crossmatch ¶
crossmatch(other: 'Catalog', *, radius, how: Literal['inner', 'left'] = 'inner', maxmatch: int = 1, dist_col: Optional[str] = None, suffix: Optional[str] = None, nested: bool = False, order_by: Optional[str] = None) -> 'Catalog'
Spatial crossmatch with other at the given radius.
Two independent axes (decoupled — every combination is expressible):
how— the join type:"inner"(default; drop a left row with no match) or"left"(keep it, with NULL right-side columns).maxmatch— the match multiplicity:1(default; the single nearest match per left row),-1(all matches within the radius — one row per (left, match) pair), orN >= 2(up to the N nearest matches within the radius). Sohow="left", maxmatch=-1keeps every counterpart and the unmatched left rows.maxmatch=0(and any value< -1) is aValueError.
radius must be an astropy Quantity with an angular unit
(a bare float is rejected — the arcsec-vs-degree ambiguity).
dist_col — when given, inject the great-circle separation
(arcsec) as a column of that name (off by default, spec §4a).
suffix — override the default _<alias> collision suffix
applied to other's columns that clash with the left side
(spec §4b).
nested — collect each root object's matches into per-row
lists instead of emitting one row per matching pair (Feature B,
"nested catalog"; M1.3). The root's own columns stay scalar; every
matched-side column (and dist_col if set) becomes a list. A
trailing .select(...) lists only the named right-side columns
(the rest are never read — projection pushdown). order_by (a flat
merged column name, optionally "<col> DESC") sorts the elements
within each list. The aggregation is partition-local (grouped on the
root), so it runs phase-1 only — no cross-partition reduce.
join ¶
join(other: 'Catalog', *, on, how: Literal['inner', 'left'] = 'inner', suffix: Optional[str] = None, nested: bool = False, order_by: Optional[str] = None) -> 'Catalog'
Ordinary equi-join on an integer ID column (§4.6.1).
other is another Catalog, or an in-memory frame (polars
/ pandas / numpy-structured / pyarrow / astropy) to broadcast: a flat
id→value lookup whose key-matching rows aren't spatially localized.
The frame is spilled once to a non-spatial virtual catalog (one
memory-mapped Arrow IPC file) and read whole into every worker, then
hash-joined locally on the key (key decision #19). A frame has no
coordinates — it's a join RHS only, never a crossmatch operand —
and nested=True over a frame is not supported yet.
on takes flat (unqualified) column names — the fluent join is
provenance-free (no alias.col refs):
* a bare column name ("diaObjectId") — used on both sides; must
name an integer column of the merged-left frame and of the right
catalog;
* a tuple (left_flat, right_flat) — left_flat names a column
of the computed merged-left frame (so you can pick an already
collision-suffixed key like "id_b"), right_flat a column of
the right catalog (("id_b", "id")).
how is "inner" (default) or "left". Both keys must be
integer-ID columns (§4.6.1); use db.sql(...) for arbitrary joins.
nested — collect each left row's join partners into per-row
lists instead of emitting one row per matching pair (Feature B,
"nested catalog"; the headline object ⋈ source ON objectId
light-curve shape). The left row's own columns stay scalar; every
right-side column becomes a list. A trailing .select(...) lists
only the named right-side columns. order_by (a flat merged column
name, optionally "<col> DESC") sorts the elements within each
list. Correctness precondition: the right catalog must be
localized with this one by the left row's HEALPix pixel (the
HATS nested-association layout — e.g. Rubin object +
objectForcedSource); the aggregation is partition-local (phase-1
only), so a partner that does not land in the left row's partition is
dropped, exactly as the flat .join() would drop it. See
docs/archive/NESTED-EQUI-JOIN.md.
execute ¶
Compile and run; return the Result.
Returned columns carry the flat suffix-named schema the fluent
compiler emits (.columns is the source of truth) — the engine
names outputs directly, so there is no boundary rename.
progress overrides the owning Connection's default rich
rendering (§4.10): True / False to force on / off,
"auto" to detect TTY / IPython, None to inherit.
head ¶
to_astropy ¶
Materialize as an astropy.table.Table (requires astropy).
Converts straight from Arrow — no pandas round-trip — so this works in a pandas-free environment and keeps integer columns integer (nulls become masked, not float-NaN).
export ¶
export(path, *, format: Optional[str] = None, progress: Optional[Union[bool, Literal['auto']]] = None) -> Path
Materialize the query and write it to a single flat file, then
return the written Path.
The export counterpart to save: export writes a result that
is leaving the system (a CSV / parquet / FITS file for another tool),
while save writes a HATS catalog that stays queryable. Sugar for
self.execute(progress=progress).export(path, format=format).
Format resolution: an explicit format= ("parquet" / "csv" /
"fits") wins; otherwise it is inferred from the path extension
(.parquet / .pq → parquet, .csv → csv, .fits / .fit
→ fits). A path with no usable extension and no format=, or an
unrecognized extension, raises ValidationError — export
never writes HATS; use save for that.
Memory contract: the full result is gathered into memory (via
Result.to_arrow) before the single-file write. This is the right
tool for target lists, proposal tables, and paper tables — and the
wrong one for full-sky outputs. For a result too large to hold in RAM,
use save (streaming, partitioned HATS).
See execute for progress semantics.
save ¶
save(path, *, name: Optional[str] = None, overwrite: bool = False, progress: Optional[Union[bool, Literal['auto']]] = None) -> 'Catalog'
Write a HATS catalog (a directory tree) at path, register it
under name, and return a fresh Catalog handle bound to the
new catalog.
Destination. A bare name (no /, e.g. save("gxt")) joins
the catalog library: it lands under the first writable local
ACID_PATH root, so the name is durably re-openable in a later
session (acid.open("gxt") / ... FROM gxt) with no path
bookkeeping — the same model acid download <name> uses. An
explicit path (./gxt, /data/gxt, ~/gxt) is used verbatim
(cwd-relative). If a bare name already resolves to a catalog at a
different location earlier on ACID_PATH, the save is refused with
RegistryError (writing it would be unreachable-by-name);
overwrite=True does not override this — pick a different name or an
explicit path.
For a single flat file (CSV / parquet / FITS) use export
instead — a single-file extension on save (save("out.csv")) is
a ValidationError (pass a trailing / to force a HATS tree
genuinely named out.csv).
Atomic-on-success: the query writes to a sibling staging
directory; the existing path and the existing registry
entry are only removed once the write completes. If the query
fails partway through, the original target is preserved
intact.
See execute for progress semantics.
Result¶
Result is the wrapper every materialization call hands back. It
holds its data either in memory (a pyarrow.Table) or on disk (a
directory of per-partition Parquet files, after a spill or a
HATS-style write); the converter and writer methods work transparently
on both.
Result and Catalog share the same converter and writer names —
to_arrow / to_polars / to_pandas / to_astropy,
export(path, format=), and save(path) — so whichever object you
have in hand, the spelling is the same. The tables below show the two
columns side by side.
Conversions to in-memory tables¶
| Target type | On a Catalog |
On a Result |
|---|---|---|
pandas.DataFrame |
cat.to_pandas() |
r.to_pandas() |
polars.DataFrame |
cat.to_polars() |
r.to_polars() |
pyarrow.Table |
cat.to_arrow() |
r.to_arrow() |
astropy.table.Table |
cat.to_astropy() |
r.to_astropy() |
list[dict] |
— | r.to_pylist() |
Writers (single-file or HATS tree)¶
| Target | On a Catalog |
On a Result |
|---|---|---|
| HATS catalog tree | cat.save(path, name=...) (streams; registers it on the connection) |
— (a Result has left the system; write HATS from the Catalog, or Connection.sql(query, output=dir)) |
| Single Parquet file | cat.export(path) (.parquet) |
r.export(path) (.parquet) |
| Single CSV file | cat.export(path) (.csv) |
r.export(path) (.csv) |
| Single FITS binary table | cat.export(path) (.fits) |
r.export(path) (.fits) |
| Explicit format override | cat.export(path, format=...) |
r.export(path, format=...) |
The Catalog.save(...) path is the one you almost always want when
the output will feed another ACID query: it writes the HATS tree,
re-registers it under a name, and returns a fresh Catalog pointing
at the saved tree. Result.export is for handing data to a non-ACID
consumer (a plotting script, a colleague's pipeline); a missing or
unrecognized extension raises ValidationError, with a message
pointing at save.
Streaming and previewing¶
| What you want | Call |
|---|---|
Iterate pyarrow.RecordBatch chunks |
r.batches(batch_size=None) |
Pretty-print the first n rows to stdout |
r.show(n=20) |
| Jupyter HTML repr | _repr_html_ — automatic |
First n rows as a new Result |
r.head(n) |
| Row count, schema, column names | r.num_rows, r.schema, r.column_names |
One column as pyarrow.ChunkedArray |
r.column(name) |
acid.Result
dataclass
¶
A materialized query result.
Backing storage is one of
- in-memory
pa.Table(_table is not None), or - a directory of per-partition Parquet files
(
_output_dir is not None), typically a HATS catalog.
Most callers should use .to_arrow(), .to_polars(), or
.to_astropy(); code that treats the result as a pa.Table
works via the passthrough surface (num_rows, column_names,
column(name), to_pylist()).
to_arrow ¶
Return the result as an in-memory pa.Table.
Loads from disk if the result was spilled or written to a HATS output directory.
to_astropy ¶
Convert to an astropy.table.Table (requires astropy).
Built directly from the Arrow table column-by-column — no pandas
round-trip (see acid.api._coerce.arrow_to_astropy).
batches ¶
Iterate over pa.RecordBatch chunks.
For in-memory results, returns the table's existing batches (or
rebatches if batch_size is set). For disk-backed results,
streams Parquet without materializing the union in memory.
export ¶
Write the result to path as one flat file and return the
written Path.
A Result is already-materialized data — it has left the
partitioned system — so it has no stays-in-the-system save: write a
HATS catalog with the lazy Catalog.save, or
Connection.sql(query, output=dir) / acid query --output dir for
the SQL surface. export is the leaves-the-system terminal, same
contract as Catalog.export: an explicit format=
("parquet" / "csv" / "fits") wins; otherwise the format is
inferred from the path extension (.parquet / .pq → parquet,
.csv → csv, .fits / .fit → fits). No usable extension, an
unrecognized one, or a non-single-file format raises
ValidationError — export never writes HATS.
show ¶
Pretty-print the first n rows to stdout.
Terminal-friendly counterpart to _repr_html_ (which
renders in Jupyter). Uses the same Polars-based formatter as
the acid query CLI, so the output looks identical to what
you'd see piping a query through the command line.
width caps the Polars tbl_width_chars setting; default
10_000 means "don't truncate — let the terminal wrap."
Pass None to use Polars's terminal-aware default.
head ¶
Return a new Result containing the first n rows
(after applying whatever ORDER BY the original query had).
__str__ ¶
Render the result as a Polars DataFrame.
print(result) converts the result to Polars and prints the
DataFrame, so the output carries Polars's shape header
((rows, cols)) and its head/tail row truncation with … —
no separate row cap here. This materializes the full result in
memory (via to_polars / to_arrow); for a large
on-disk result, prefer show (first n rows only) or
repr for the terse one-line summary.
Aggregate constructors — acid.agg¶
acid.agg is a namespace of constructor functions for the
decomposable aggregates ACID supports. Each returns an AggExpr
you pass as a keyword argument to Catalog.aggregate(...); the
keyword name becomes the output column name.
from acid import agg
(cat.group_by("band")
.aggregate(n=agg.count(), mean_mag=agg.mean("mag"))
.where("n > 100"))
The full set is count, sum, mean, min, max, std,
var, all, any, list. There is no agg.median /
agg.mode — those are non-decomposable and rejected on both the
fluent and SQL surfaces with ValidationError. The escape hatch is
to aggregate decomposably and finish in Polars or pandas
(r.to_polars().group_by(...).agg(pl.col("mag").median())); see
the aggregation guide's "Why no agg.median?" section.
acid.AggExpr
dataclass
¶
One aggregate in a Catalog.aggregate call.
func is a lowercase name understood by
acid.plan.aggregates.decompose_agg; arg is a flat column
name (or "*" for COUNT(*)); out_name is the output column
name, set by Catalog.aggregate from the keyword argument.
Frozen + hashable so it can sit in the Catalog's structural hash.
Python functions — acid.function¶
@acid.function attaches a UDF's columns / schema / mode
metadata once at the definition site, so they don't have to be passed
at every with_columns / map_partitions call. On a class it
makes a deferred-construction factory for stateful UDFs (a heavy
resource built once per worker, never shipped in the task payload).
The task-shaped walkthrough is in
Python functions on partitions.
import acid
@acid.function(columns=["mag", "err"], schema="f8")
def snr(mag, err):
return mag / err
cat.with_columns("snr", snr) # columns=/schema= ride the function
acid.function ¶
Attach UDF metadata to a function, or make a class a deferred factory.
Usable bare (@acid.function) or parameterized
(@acid.function(columns=[...], schema="f8", mode="numpy")). On a plain
function it sets acid_columns / acid_schema / acid_mode (any not
given is left unset) and returns the function. On a class it returns a
factory whose calls produce a deferred, callable _Deferred handle
carrying the same metadata.
Registry¶
A Registry is the in-memory map from catalog name → TableSpec
(path, RA / Dec columns, HEALPix order, margin cache, schema, …).
You rarely build one directly — acid.Connection("/path") and
acid.Connection("config.yaml") both produce one for you — but the
class is exposed for callers that want to assemble a registry
programmatically or merge several roots.
acid.Registry ¶
Resolve catalog names to TableSpec, with HATS auto-detection.
Optionally also holds named MOC footprints, registered via
register_moc or a top-level mocs: section in the YAML.
They're looked up by the analyzer when it encounters IN_MOC()
predicates in a query.
from_directory
classmethod
¶
Auto-discover HATS catalogs in subdirectories of path.
Each subdirectory containing properties or hats.properties
becomes a table named after the directory. point_map.fits files
are auto-registered as MOCs.
register_moc ¶
register_moc(name: str, source: Union[str, Path, 'mocpy.MOC', 'np.ndarray', 'MocSpec']) -> 'MocSpec'
Register a MOC by name. source is a FITS path, an in-memory
mocpy.MOC, an (N, 2) numpy array of order-29 [lo, hi)
ranges, or an already-built MocSpec.
get_moc ¶
Return the MOC registered as name, lazily falling back to a
registered catalog's point_map.fits footprint when no explicit
registration matches.
Resolution order
- Explicitly registered MOC named
name. - Registered catalog named
namewhose<path>/point_map.fitsexists — auto-loaded and cached undernameso subsequent lookups are free. - Otherwise raise
RegistryErrornaming both attempts.
catalog_footprint ¶
Return the catalog's footprint MOC loaded from point_map.fits,
cached on first access. None when the catalog isn't registered
or has no point_map.fits. Used by the analyzer to scope IN_MOC
predicates to cells where the catalog actually has data —
independent of Registry._mocs so explicit registrations don't
shadow the catalog's own footprint.
has_moc ¶
Cheap pre-execution check: would get_moc(name) succeed?
Stats the catalog's point_map.fits for the auto-resolution
path but doesn't read it.
is_moc_registered ¶
Strict variant of has_moc: returns True only when
name is an explicitly-registered MOC, not when the
catalog-footprint fallback would synthesize one.
Used by the fluent-Catalog compiler hook for idempotent auto-MOC installation — we want to skip re-registration of an identical content-hashed MOC, but not collapse a registered MOC name onto a catalog of the same name.
Errors¶
All ACID-originated exceptions inherit from acid.AcidError, so a
single except acid.AcidError catches every library failure:
import acid
try:
acid.sql.query("SELECT BOGUS FROM doesnt_exist")
except acid.AcidError as e:
print("acid said:", e)
The specific subclasses let you handle distinct failure modes separately. Each one is documented at length in the errors reference — that page is the right place to look when you have a specific message in hand and need to know what to type to fix it.
Hierarchy at a glance:
AcidError— base class. Carriesquery,span,hint,suggestion.ParseError— SQL the parser can't handle, or an extension (XMATCH,IN_MOC, inline subquery, CTE) with the wrong shape.ValidationError— query parses but isn't a shape ACID can run (unsupported predicate position, non-decomposable aggregate, margin-cache violation, unknown catalog).RegistryError— catalog / MOC registration problems.ExecutionError— per-partition execution failure (corrupt parquet, OOM, disk full). The first failure aborts the whole job.OutputError— output sink failure (write permission, schema mismatch in streamed write).ConnectionClosedError—Connection(orCatalogbound to it) used afterConnection.close().ConfigError—acid.confproblems (missing file, parse failure, bad value), oracid.init(...)called with a config that conflicts with an already-initialized default connection.
acid.AcidError ¶
Bases: Exception
Base class for all acid errors.
Subclasses (ParseError, ValidationError etc.) inherit the
same constructor and renderer. Library callers can catch
AcidError to handle every acid-originated failure uniformly.
See also¶
- SQL features — what the analyzer accepts and rejects, with the literal error messages.
- Errors reference — every typed exception with the concrete fix.
- CLI — the
acidcommand-line tool. - Working with results — the
Catalog/Resultconverter story in task-shape. - Debug small, run big — the composition-vs-materialization split in action.