Skip to content

Debug small, run big

You wrote a crossmatch (or a filter, or an aggregate) and want to confirm it does the right thing before you spend an hour of cluster time grinding through the whole sky. The single most useful technique for that — the one experienced ACID users reach for without thinking — is to develop the query against a small slice of the sky, then drop the slice for the production run.

This page is the recipe. It is short because the underlying rule is short: composing verbs is free; only materialization actually reads data.

The rule

Catalog methods come in two kinds.

  • Composition verbs return a new Catalog. They do not read any parquet, do not start workers, do not send anything to the query engine. They record what you want to do, in memory, on the Connection process.
  • Materialization verbs run the recorded query end-to-end. They enumerate partitions, launch (or reuse) the worker pool, read parquet, run the matcher / reducer, and hand you back a Result, a DataFrame, an astropy.table.Table, or a HATS catalog on disk.

That split is what makes the small-then-big pattern work: composing a debug slice on top of a query is free (no extra I/O), and removing the slice for the production run is a one-line change.

The two kinds, side by side

Composition (lazy — returns Catalog) Materialization (runs the query)
where(predicate) head(n) — eager LIMIT n, returns Result
select(*cols) execute() — full run, returns Result
limit(n) to_pandas() / to_polars() / to_arrow() / to_astropy()
in_region(name_or_moc) save(path, name=...) — write a HATS catalog and re-register
crossmatch(other, radius=..., ...) acid.sql.query(...) — runs immediately and returns Result
join(other, on=..., ...)
group_by(*keys) / aggregate(**named)
where(...) after aggregate / sort(*keys)

Plus one scoping context manager you'll meet below: with acid.in_cone(center, radius=...): restricts every query executed inside the block to a sky cone. It is not a materialization verb — nothing runs until you call one of the right-hand ones — but it changes the result of every materialization inside it.

Why this is free

A composition chain like

(gaia.where("phot_g_mean_mag < 18")
     .crossmatch(twomass, radius=1 * u.arcsec, dist_col="d")
     .where("d < 0.5")
     .select("source_id, designation, d")
     .limit(1000))

is just five small dataclass updates. No file is touched. The work the query engine does for the eventual .execute() is less the more you compose: .select(...) narrows the columns the parquet reader pulls; .where(...) predicates push into per-partition filters and row-group statistics; .limit(n) propagates to each partition as a phase-1 LIMIT. None of that fires until you call a materialization verb.

That also means composing a debug slice on top of an otherwise expensive query is free. Adding a cone in front of a billion-row crossmatch doesn't make the crossmatch cheaper to develop — it makes it not run yet at all.

The two "small slice" verbs

In practice, debugging a query against a small slice means one of two things.

head(n) — eager LIMIT, returns a Result

Catalog.head(n) is the same as .limit(n).execute() — it runs the query, with a partition-pushed LIMIT, and gives you back the first n rows. Use it for shape checks: does my projection have the right columns, are the numbers in the right ballpark, did my filter exclude all rows by mistake?

head_check.py
import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)   # optional — first acid.open() lazy-inits

gaia = acid.open("gaia_dr3")
tm   = acid.open("twomass_psc")

# 10 rows, runs in seconds: enough to eyeball columns / types / NULLs.
r = (gaia
     .where("phot_g_mean_mag < 18")
     .crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
     .select("source_id, designation, d")
     .head(10))

r.show()

head(n) is fast but not always cheap

head(n) does not filter by sky region. It walks partitions until it has collected n matching rows, then stops. On a selective query that's fast; on a query whose first survivor is on the other side of the sky from the partition the worker happened to start on, it can still touch a lot of data. For "small in area, not just small in rows", reach for in_cone.

in_cone(center, radius=...) — sky region at execution time

acid.in_cone(...) is a context manager. Every query executed inside the block — fluent or SQL — is restricted to a circular sky region. Partitions that don't overlap the cone are pruned at enumeration time, before any worker is dispatched, so a 1° debug cone on a full-sky catalog reads a few partitions instead of thousands.

in_cone_debug.py
import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)

gaia = acid.open("gaia_dr3")
tm   = acid.open("twomass_psc")

matched = (gaia
           .where("phot_g_mean_mag < 18")
           .crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
           .select("source_id, designation, d"))

# Develop against a 1° cone around (180°, 0°). Same query object; the
# cone applies because the *materialization* happens inside the block.
with acid.in_cone((180.0, 0.0), radius=1 * u.deg):
    df = matched.to_polars()           # full result *inside the cone*
    print(df.shape)
    print(df["d"].describe())          # eyeball the separation distribution

The cone is an execution-time scope, not a property of the Catalog:

  • It applies at materialization, not construction. The cone in effect when you call .to_polars() / .execute() / .head() / .save() is the one that scopes that run. The same query object runs scoped inside the block and full-sky outside it — there is nothing to "remember" and no stale-handle error.
  • One cone at a time. acid.in_cone(...) blocks do not nest; entering a second while one is active raises ValidationError.

Pick a cone with sources in it

If your debug cone returns zero rows, you can't tell whether the query is correct. Pick a center where you know the catalog has coverage — the Galactic plane for a Galactic survey, the DES footprint for a southern dataset, a known science field for Rubin. Verify with a one-line head(10) on a single catalog before composing the full query.

The pattern — develop in a cone, run on the sky

The canonical workflow:

debug_small_run_big.py
import acid
import astropy.units as u
from acid import agg

acid.init("catalogs.yaml", workers=8)

# The query, written once. No cone in here.
gaia = acid.open("gaia_dr3")
tm   = acid.open("twomass_psc")
matched = (gaia
           .where("phot_g_mean_mag < 18")
           .crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
           .group_by("ruwe < 1.4")
           .aggregate(n=agg.count(), mean_d=agg.mean("d"))
           .where("n > 0"))

# 1. Develop in a 1° cone — runs in seconds.
with acid.in_cone((180.0, 0.0), radius=1 * u.deg):
    df = matched.to_polars()
    print(df)                       # eyeball, iterate, re-run

# 2. Production run — the same query object, no cone.
matched.save("matches_full_sky")   # streams; or .export("m.parquet") for one file

Two things this pattern buys you:

  1. The query is the same in both phases. No LIMIT to remove, no WHERE clause to delete, no different code paths, and no need to rebuild the handle — the same matched object runs scoped inside the cone and full-sky outside it. The debug harness is the cone, not the query.
  2. The slowest thing you'll ever debug is the cone. A 1° cone on a full-sky HATS catalog typically reads a single-digit number of partitions. Even a complex crossmatch+aggregate inside that cone runs in seconds at most — fast enough to iterate while you read the printed df.

When the in-cone numbers look right, just materialize outside the with block for the full-sky run. The shift from "debug" to "production" is one line of indentation, not a code rewrite.

Other lazy-vs-eager moments worth knowing

A few smaller properties that follow from the same composition / materialization split:

  • Catalog is immutable. Composition verbs return a new handle; the old one is unchanged. Branching is free — gaia_bright = gaia.where("g < 18") and gaia_faint = gaia.where("g > 20") are two independent handles over the same connection.
  • .columns, .alias, describe(), explain() are cheap and do no I/O beyond cached metadata. describe() reads the per-partition row counts that are already in HATS metadata; it does not scan the rows. explain() returns the analyzed plan as a string for inspection. Use them when you're not sure what you're about to materialize.
  • Result is materialized, but conversions are cheap. r.to_pandas(), r.to_polars(), r.to_arrow() all work over the same in-memory pa.Table (or stream from disk if the result was spilled). Converting between them does not re-run the query. See Working with results for the converter table.
  • Catalog.save(path, name=...) is the one composition-shaped verb that does materialize. It writes a HATS catalog to disk and re-registers it on the connection, then returns a fresh Catalog pointing at the saved tree. Use it when the same intermediate feeds many downstream queries — pay for the materialization once, reuse for free.

When head(n) is enough — and when it isn't

A rule of thumb:

You want to check Reach for
Schema / projection / column types head(10)
A non-empty result exists at all head(10)
Filter selectivity — "how many rows pass?" cone + .aggregate(n=agg.count())
Crossmatch quality — separation distribution, NULLs cone + .to_polars()
Aggregate output shape — group keys, NULLs in keys cone + .to_polars()
Distribution-shape questions ("is my color cut right?") cone + .to_astropy() + your usual plotting

head(n) is cheaper to type and faster on a selective query, but it gives you a sample, not a slice. Anything where the distribution matters — a separation histogram, a magnitude cut, an aggregate breakdown — wants a cone.

See also

  • Crossmatching catalogs — the matcher verb most often debugged with this pattern, plus the four astronomy- correctness checks (J2000 epoch, radius vs. margin cache, RA-wrap / poles, output units).
  • Aggregating — the other shape that composes cheaply (group_by / aggregate / post-aggregate where / sort are all lazy) and is best developed inside a cone before being unleashed.
  • Working with results — once the query is right and you have a Result, this is where the converter table and the I/O behavior live.
  • Performance & parallelism — once the query is right and you want to make the full-sky run go faster: workers, threads, memory.