Skip to content

Working with results and exporting

Every materialized query in ACID — whether you called cat.head(10), cat.execute(), or acid.sql.query("...") — hands you back a Result. This page covers what Result is, how to turn it into the kind of object your analysis code wants (pandas.DataFrame, polars.DataFrame, astropy.table.Table, pa.Table), how to stream it when it's too big to fit in memory, and how to write it to disk.

It also covers the units the data comes back in — important enough that the rest of the page assumes you've read the Output units section first.

Result vs. Catalog — which has which converter?

ACID's two main user-facing types share the same converter names, so you don't need to remember which noun you're holding:

Want Call on Catalog (composition handle) Call on Result (materialized)
pandas.DataFrame cat.to_pandas() r.to_pandas()
polars.DataFrame cat.to_polars() r.to_polars()
pa.Table cat.to_arrow() r.to_arrow()
astropy.table.Table cat.to_astropy() r.to_astropy()
list of row dicts r.to_pylist()

The Catalog.to_* methods are a convenience: they .execute() the query and convert the resulting Result for you in one call. If you already have a Result in hand (from acid.sql.query(...) or from cat.execute()), the per-type calls above are equivalent.

The most common path

import acid

acid.init("/data/hats")
gaia = acid.open("gaia_dr3")
df = (gaia
      .where("phot_g_mean_mag < 16")
      .select("source_id, ra, dec, phot_g_mean_mag")
      .to_polars())                  # Catalog.to_polars — one call

The same query, two-step:

r = (gaia
     .where("phot_g_mean_mag < 16")
     .select("source_id, ra, dec, phot_g_mean_mag")
     .execute())                       # -> Result

df = r.to_pandas()                     # convert when you need to

Use the two-step shape when you want to do something with the Result itself first — r.show() to preview, r.export(...) to write a flat file, or r.batches() to stream.

Output units

ACID does no unit conversion. Every column comes back in whatever units the source catalog stored it in:

  • ra, dec — degrees, ICRS. (Both reading and writing.)
  • Magnitudes, fluxes, errors, IDs — exactly what the source catalog stored.
  • dist_col — the XMATCH(..., dist_col => 'd') / fluent crossmatch(..., dist_col="d") separation column is great-circle separation in arcseconds, float64, and is NULL on unmatched LEFT rows.

If your downstream code expects something else (degrees, radians, mJy, ergs), do the conversion at the boundary — either in SQL (SELECT d / 3600.0 AS d_deg), in a Catalog.select(...) expression, or after .to_pandas() / .to_polars().

All RA/Dec are J2000 — no epoch propagation, ever

acid treats every catalog's stored RA/Dec as J2000 (ICRS). There is no proper-motion handling, no parallax, no epoch propagation. If you registered a Gaia (J2016) catalog and matched it against a J2000 survey, high-proper-motion sources will be off the match radius — the matcher does not know to bring them forward. See crossmatch: epoch — all RA/Dec are J2000 for the workaround (propagate to J2000 before registering).

Why no auto-conversion?

Two reasons. (1) Survey conventions disagree (mag vs. flux, AB vs. Vega, mJy vs. nJy) and ACID can't guess yours from column names. (2) Pass-through keeps round-trips through .save() lossless — what you read is what you wrote.

Pretty-printing and the REPL

.show() prints the first 20 rows with a fixed-width Polars table formatter — the same renderer the acid CLI uses, so notebook output matches what you'd see piping acid query "...".

>>> r = gaia.where("phot_g_mean_mag < 16").head(5).execute()
>>> r.show()
shape: (5, 4)
┌──────────────────────┬──────────┬─────────┬──────────────────┐
│ source_id            ┆ ra       ┆ dec     ┆ phot_g_mean_mag  │
│ ---                  ┆ ---      ┆ ---     ┆ ---              │
│ i64                  ┆ f64      ┆ f64     ┆ f64              │
╞══════════════════════╪══════════╪═════════╪══════════════════╡
│ 4295806720           ┆ 44.99632 ┆ 0.00514 ┆ 15.87            │
...

print(r) (Result.__str__) instead renders the result as a Polars DataFrame — its shape: (rows, cols) header plus Polars's own head/tail row truncation (it materializes the full result; for a large on-disk result prefer r.show(n), which reads only the first n rows). Dropping a bare r in a Jupyter cell renders an HTML table (via _repr_html_). For terse one-line debugging, repr(r) returns a single-line summary (<acid.Result rows=… cols=[…] (in-memory)>).

r.head(n) returns a new Result containing the first n rows (streams from disk if needed). It does not modify r and is cheap to call repeatedly.

Converters in detail

r.to_pandas()

Materialize as a pandas.DataFrame. Loads the full result into memory; for multi-million-row results you usually want to_polars() instead.

df = r.to_pandas()

r.to_polars()

Materialize as a polars.DataFrame. Requires polars installed (it already is — ACID's engine uses it).

pdf = r.to_polars()
agg = pdf.group_by("band").agg(pl.col("mag").mean())

Want it faster?

to_pandas() is the obvious default for astronomers. For anything heavier than a quick .head() look on multi-million-row results — group-by, filtering, joins — r.to_polars() is typically 5–50× faster than pandas. You can always come back with polars_df.to_pandas().

r.to_arrow()

Return the underlying pyarrow.Table. If the result was spilled to disk, this loads it back into memory first.

table = r.to_arrow()       # pa.Table

r.to_astropy()

Materialize as an astropy.table.Table — the natural catalog type (units, coords, FITS-friendly columns). Same conversion as Catalog.to_astropy().

tbl = r.to_astropy()

r.to_pylist()

A list of row dicts. Useful for small results being fed into other APIs (logging, JSON serialization). Avoid on anything over a few thousand rows — pandas / Polars are vastly faster for tabular work.

Streaming when the result is too big

Some results don't fit in memory: a full-sky crossmatch, a tall projection. Two mechanisms handle this without OOM-ing the parent:

Automatic spill (the default)

When the running result exceeds inmem_row_limit (default 50M rows), ACID spills phase-1 partials to a per-connection tempdir instead of collecting them all in RAM. The Result you get back is then disk-backedr.to_arrow() loads it lazily.

You can change the threshold when you initialize:

acid.init("/data/hats", inmem_row_limit=200_000_000)

…or per environment (ACID_INMEM_ROW_LIMIT=200_000_000).

r.batches() — stream record batches

When you'd rather process the result in chunks than load it whole, r.batches() yields pa.RecordBatch objects. Works for both in-memory and disk-backed results:

total = 0
for batch in r.batches(batch_size=100_000):
    # `batch` is a pa.RecordBatch — process or write it.
    total += batch.num_rows

Iterating a Result is the same as iterating r.batches() with default chunking:

for batch in r:
    ...

To disk: save vs export

A query usually ends in one of two terminal verbs on the Catalog — no .execute() hop needed. Pick by where the result is going:

q = acid.open("gaia").where("phot_g_mean_mag < 18")

q.save("bright_gaia")          # stays queryable — a HATS catalog
q.export("bright_gaia.csv")    # leaves as one file — CSV / parquet / FITS

save = stays queryable (streaming, any size). Writes a HATS catalog directory and returns a Catalog handle to it. It streams partition by partition, so it scales to full-sky outputs. A bare name (no /) joins your catalog library — it lands under your first writable ACID_PATH root, so a later session re-opens it by name with no path bookkeeping:

acid.open("gaia").where("phot_g_mean_mag < 18").save("bright_gaia")
# ... new session, same machine ...
acid.open("bright_gaia")       # resolved by name, no path needed

An explicit path (./out, /data/out, ~/out) is used verbatim instead.

export = leaves as one file (in-memory, modest sizes). Materializes the full result in memory, then writes a single flat file; returns the written Path. The format comes from the extension (.csv / .parquet / .pq / .fits / .fit) or an explicit format=. This is the right tool for target lists, proposal tables, and paper tables.

export holds the whole result in RAM

Unlike save (which streams), export gathers every row before writing the single file. That's fine for a selective query, but scaling the same pipeline up to a full-sky sample can OOM your laptop. For large outputs use save (streaming, partitioned HATS) and read it back with acid.open(...).

save("out.csv") is a deliberate error (it points you at export); pass a trailing slash — save("out.csv/") — only if you genuinely want a HATS directory named out.csv. export never writes HATS — use save for that.

Writing to disk

A Result is already-materialized data — it has left the partitioned system — so its only file output is export, a single flat file:

r.export("out.parquet")    # -> single Parquet file
r.export("out.csv")        # -> single CSV
r.export("out.fits")       # -> single FITS binary table

export(path) infers the format from the extension (.parquet / .pq → parquet, .csv → csv, .fits / .fit → fits); pass format= explicitly to override. A missing or unrecognized extension raises ValidationErrorexport never writes HATS.

HATS output stays in the system — write it from a Catalog

There is no Result.save. A partitioned HATS catalog is written by the lazy, still-in-the-system handle, which streams it correctly (any size) and keeps its _healpix_29 spatial index: use Catalog.save(path, name="...") for the registered EDA pattern, or acid.sql.query(query, output="dir/") / acid query --output dir/ for the SQL surface. See Catalogs and the registry.

Quick sanity checks

You want Call
Row count r.num_rows or len(r)
Column names r.column_names
Schema (with types) r.schema
A single column (no copy) r.column("ra")
First 10 rows printed r.head(10).show()
First 10 rows as DataFrame r.head(10).to_pandas()