Working with results and exporting¶
Every materialized query in ACID — whether you called cat.head(10),
cat.execute(), or acid.sql.query("...") —
hands you back a Result. This page covers what Result is, how to
turn it into the kind of object your analysis code wants
(pandas.DataFrame, polars.DataFrame, astropy.table.Table,
pa.Table), how to stream it when it's too big to fit in memory, and
how to write it to disk.
It also covers the units the data comes back in — important enough that the rest of the page assumes you've read the Output units section first.
Result vs. Catalog — which has which converter?¶
ACID's two main user-facing types share the same converter names, so you don't need to remember which noun you're holding:
| Want | Call on Catalog (composition handle) |
Call on Result (materialized) |
|---|---|---|
pandas.DataFrame |
cat.to_pandas() |
r.to_pandas() |
polars.DataFrame |
cat.to_polars() |
r.to_polars() |
pa.Table |
cat.to_arrow() |
r.to_arrow() |
astropy.table.Table |
cat.to_astropy() |
r.to_astropy() |
| list of row dicts | — | r.to_pylist() |
The Catalog.to_* methods are a convenience: they .execute() the
query and convert the resulting Result for you in one call. If you
already have a Result in hand (from acid.sql.query(...) or from
cat.execute()), the per-type calls above are equivalent.
The most common path¶
import acid
acid.init("/data/hats")
gaia = acid.open("gaia_dr3")
df = (gaia
.where("phot_g_mean_mag < 16")
.select("source_id, ra, dec, phot_g_mean_mag")
.to_polars()) # Catalog.to_polars — one call
The same query, two-step:
r = (gaia
.where("phot_g_mean_mag < 16")
.select("source_id, ra, dec, phot_g_mean_mag")
.execute()) # -> Result
df = r.to_pandas() # convert when you need to
Use the two-step shape when you want to do something with the Result
itself first — r.show() to preview, r.export(...) to write a flat file,
or r.batches() to stream.
Output units¶
ACID does no unit conversion. Every column comes back in whatever units the source catalog stored it in:
ra,dec— degrees, ICRS. (Both reading and writing.)- Magnitudes, fluxes, errors, IDs — exactly what the source catalog stored.
dist_col— theXMATCH(..., dist_col => 'd')/ fluentcrossmatch(..., dist_col="d")separation column is great-circle separation in arcseconds,float64, and is NULL on unmatched LEFT rows.
If your downstream code expects something else (degrees, radians,
mJy, ergs), do the conversion at the boundary — either in SQL
(SELECT d / 3600.0 AS d_deg), in a Catalog.select(...) expression,
or after .to_pandas() / .to_polars().
All RA/Dec are J2000 — no epoch propagation, ever
acid treats every catalog's stored RA/Dec as J2000 (ICRS).
There is no proper-motion handling, no parallax, no epoch
propagation. If you registered a Gaia (J2016) catalog and matched
it against a J2000 survey, high-proper-motion sources will be
off the match radius — the matcher does not know to bring them
forward. See crossmatch: epoch — all RA/Dec are J2000
for the workaround (propagate to J2000 before registering).
Why no auto-conversion?
Two reasons. (1) Survey conventions disagree (mag vs. flux,
AB vs. Vega, mJy vs. nJy) and ACID can't guess yours from
column names. (2) Pass-through keeps round-trips through .save()
lossless — what you read is what you wrote.
Pretty-printing and the REPL¶
.show() prints the first 20 rows with a fixed-width Polars table
formatter — the same renderer the acid CLI uses, so notebook output
matches what you'd see piping acid query "...".
>>> r = gaia.where("phot_g_mean_mag < 16").head(5).execute()
>>> r.show()
shape: (5, 4)
┌──────────────────────┬──────────┬─────────┬──────────────────┐
│ source_id ┆ ra ┆ dec ┆ phot_g_mean_mag │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════════════════╪══════════╪═════════╪══════════════════╡
│ 4295806720 ┆ 44.99632 ┆ 0.00514 ┆ 15.87 │
...
print(r) (Result.__str__) instead renders the result as a Polars
DataFrame — its shape: (rows, cols) header plus Polars's own head/tail
row truncation (it materializes the full result; for a large on-disk
result prefer r.show(n), which reads only the first n rows). Dropping a
bare r in a Jupyter cell renders an HTML table (via _repr_html_). For
terse one-line debugging, repr(r) returns a single-line summary
(<acid.Result rows=… cols=[…] (in-memory)>).
r.head(n) returns a new Result containing the first n rows
(streams from disk if needed). It does not modify r and is cheap
to call repeatedly.
Converters in detail¶
r.to_pandas()¶
Materialize as a pandas.DataFrame. Loads the full result into
memory; for multi-million-row results you usually want to_polars()
instead.
r.to_polars()¶
Materialize as a polars.DataFrame. Requires polars installed
(it already is — ACID's engine uses it).
Want it faster?
to_pandas() is the obvious default for astronomers. For anything
heavier than a quick .head() look on multi-million-row results
— group-by, filtering, joins — r.to_polars() is typically
5–50× faster than pandas. You can always come back with
polars_df.to_pandas().
r.to_arrow()¶
Return the underlying pyarrow.Table. If the result was spilled to
disk, this loads it back into memory first.
r.to_astropy()¶
Materialize as an astropy.table.Table — the natural catalog type
(units, coords, FITS-friendly columns). Same conversion as
Catalog.to_astropy().
r.to_pylist()¶
A list of row dicts. Useful for small results being fed into other APIs (logging, JSON serialization). Avoid on anything over a few thousand rows — pandas / Polars are vastly faster for tabular work.
Streaming when the result is too big¶
Some results don't fit in memory: a full-sky crossmatch, a tall projection. Two mechanisms handle this without OOM-ing the parent:
Automatic spill (the default)¶
When the running result exceeds inmem_row_limit (default 50M rows),
ACID spills phase-1 partials to a per-connection tempdir instead of
collecting them all in RAM. The Result you get back is then
disk-backed — r.to_arrow() loads it lazily.
You can change the threshold when you initialize:
…or per environment (ACID_INMEM_ROW_LIMIT=200_000_000).
r.batches() — stream record batches¶
When you'd rather process the result in chunks than load it whole,
r.batches() yields pa.RecordBatch objects. Works for both
in-memory and disk-backed results:
total = 0
for batch in r.batches(batch_size=100_000):
# `batch` is a pa.RecordBatch — process or write it.
total += batch.num_rows
Iterating a Result is the same as iterating r.batches() with
default chunking:
To disk: save vs export¶
A query usually ends in one of two terminal verbs on the Catalog — no
.execute() hop needed. Pick by where the result is going:
q = acid.open("gaia").where("phot_g_mean_mag < 18")
q.save("bright_gaia") # stays queryable — a HATS catalog
q.export("bright_gaia.csv") # leaves as one file — CSV / parquet / FITS
save = stays queryable (streaming, any size). Writes a HATS catalog
directory and returns a Catalog handle to it. It streams partition by
partition, so it scales to full-sky outputs. A bare name (no /) joins
your catalog library — it lands under your first writable ACID_PATH root,
so a later session re-opens it by name with no path bookkeeping:
acid.open("gaia").where("phot_g_mean_mag < 18").save("bright_gaia")
# ... new session, same machine ...
acid.open("bright_gaia") # resolved by name, no path needed
An explicit path (./out, /data/out, ~/out) is used verbatim instead.
export = leaves as one file (in-memory, modest sizes). Materializes the
full result in memory, then writes a single flat file; returns the written
Path. The format comes from the extension (.csv / .parquet / .pq /
.fits / .fit) or an explicit format=. This is the right tool for target
lists, proposal tables, and paper tables.
export holds the whole result in RAM
Unlike save (which streams), export gathers every row before writing
the single file. That's fine for a selective query, but scaling the same
pipeline up to a full-sky sample can OOM your laptop. For large outputs
use save (streaming, partitioned HATS) and read it back with
acid.open(...).
save("out.csv") is a deliberate error (it points you at export); pass a
trailing slash — save("out.csv/") — only if you genuinely want a HATS
directory named out.csv. export never writes HATS — use save for that.
Writing to disk¶
A Result is already-materialized data — it has left the partitioned
system — so its only file output is export, a single flat file:
r.export("out.parquet") # -> single Parquet file
r.export("out.csv") # -> single CSV
r.export("out.fits") # -> single FITS binary table
export(path) infers the format from the extension (.parquet / .pq →
parquet, .csv → csv, .fits / .fit → fits); pass format= explicitly
to override. A missing or unrecognized extension raises ValidationError —
export never writes HATS.
HATS output stays in the system — write it from a Catalog
There is no Result.save. A partitioned HATS catalog is written by the
lazy, still-in-the-system handle, which streams it correctly (any size)
and keeps its _healpix_29 spatial index: use
Catalog.save(path, name="...") for the registered EDA pattern, or
acid.sql.query(query, output="dir/") / acid query --output dir/ for
the SQL surface. See Catalogs and the registry.
Quick sanity checks¶
| You want | Call |
|---|---|
| Row count | r.num_rows or len(r) |
| Column names | r.column_names |
| Schema (with types) | r.schema |
| A single column (no copy) | r.column("ra") |
| First 10 rows printed | r.head(10).show() |
| First 10 rows as DataFrame | r.head(10).to_pandas() |