Debug small, run big¶
You wrote a crossmatch (or a filter, or an aggregate) and want to confirm it does the right thing before you spend an hour of cluster time grinding through the whole sky. The single most useful technique for that — the one experienced ACID users reach for without thinking — is to develop the query against a small slice of the sky, then drop the slice for the production run.
This page is the recipe. It is short because the underlying rule is short: composing verbs is free; only materialization actually reads data.
The rule¶
Catalog methods come in two kinds.
- Composition verbs return a new
Catalog. They do not read any parquet, do not start workers, do not send anything to the query engine. They record what you want to do, in memory, on the Connection process. - Materialization verbs run the recorded query end-to-end. They
enumerate partitions, launch (or reuse) the worker pool, read
parquet, run the matcher / reducer, and hand you back a
Result, aDataFrame, anastropy.table.Table, or a HATS catalog on disk.
That split is what makes the small-then-big pattern work: composing a debug slice on top of a query is free (no extra I/O), and removing the slice for the production run is a one-line change.
The two kinds, side by side¶
Composition (lazy — returns Catalog) |
Materialization (runs the query) |
|---|---|
where(predicate) |
head(n) — eager LIMIT n, returns Result |
select(*cols) |
execute() — full run, returns Result |
limit(n) |
to_pandas() / to_polars() / to_arrow() / to_astropy() |
in_region(name_or_moc) |
save(path, name=...) — write a HATS catalog and re-register |
crossmatch(other, radius=..., ...) |
acid.sql.query(...) — runs immediately and returns Result |
join(other, on=..., ...) |
|
group_by(*keys) / aggregate(**named) |
|
where(...) after aggregate / sort(*keys) |
Plus one scoping context manager you'll meet below:
with acid.in_cone(center, radius=...): restricts every query
executed inside the block to a sky cone. It is not a materialization
verb — nothing runs until you call one of the right-hand ones — but it
changes the result of every materialization inside it.
Why this is free¶
A composition chain like
(gaia.where("phot_g_mean_mag < 18")
.crossmatch(twomass, radius=1 * u.arcsec, dist_col="d")
.where("d < 0.5")
.select("source_id, designation, d")
.limit(1000))
is just five small dataclass updates. No file is touched. The work
the query engine does for the eventual .execute() is less the
more you compose: .select(...) narrows the columns the parquet
reader pulls; .where(...) predicates push into per-partition filters
and row-group statistics; .limit(n) propagates to each partition
as a phase-1 LIMIT. None of that fires until you call a materialization
verb.
That also means composing a debug slice on top of an otherwise expensive query is free. Adding a cone in front of a billion-row crossmatch doesn't make the crossmatch cheaper to develop — it makes it not run yet at all.
The two "small slice" verbs¶
In practice, debugging a query against a small slice means one of two things.
head(n) — eager LIMIT, returns a Result¶
Catalog.head(n) is the same as .limit(n).execute() — it runs the
query, with a partition-pushed LIMIT, and gives you back the first
n rows. Use it for shape checks: does my projection have the right
columns, are the numbers in the right ballpark, did my filter exclude
all rows by mistake?
import acid
import astropy.units as u
acid.init("catalogs.yaml", workers=8) # optional — first acid.open() lazy-inits
gaia = acid.open("gaia_dr3")
tm = acid.open("twomass_psc")
# 10 rows, runs in seconds: enough to eyeball columns / types / NULLs.
r = (gaia
.where("phot_g_mean_mag < 18")
.crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
.select("source_id, designation, d")
.head(10))
r.show()
head(n) is fast but not always cheap
head(n) does not filter by sky region. It walks partitions
until it has collected n matching rows, then stops. On a
selective query that's fast; on a query whose first survivor is on
the other side of the sky from the partition the worker happened to
start on, it can still touch a lot of data. For "small in area,
not just small in rows", reach for in_cone.
in_cone(center, radius=...) — sky region at execution time¶
acid.in_cone(...) is a context manager. Every query executed inside
the block — fluent or SQL — is restricted to a circular sky region.
Partitions that don't overlap the cone are pruned at enumeration time,
before any worker is dispatched, so a 1° debug cone on a full-sky catalog
reads a few partitions instead of thousands.
import acid
import astropy.units as u
acid.init("catalogs.yaml", workers=8)
gaia = acid.open("gaia_dr3")
tm = acid.open("twomass_psc")
matched = (gaia
.where("phot_g_mean_mag < 18")
.crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
.select("source_id, designation, d"))
# Develop against a 1° cone around (180°, 0°). Same query object; the
# cone applies because the *materialization* happens inside the block.
with acid.in_cone((180.0, 0.0), radius=1 * u.deg):
df = matched.to_polars() # full result *inside the cone*
print(df.shape)
print(df["d"].describe()) # eyeball the separation distribution
The cone is an execution-time scope, not a property of the Catalog:
- It applies at materialization, not construction. The cone in effect
when you call
.to_polars()/.execute()/.head()/.save()is the one that scopes that run. The same query object runs scoped inside the block and full-sky outside it — there is nothing to "remember" and no stale-handle error. - One cone at a time.
acid.in_cone(...)blocks do not nest; entering a second while one is active raisesValidationError.
Pick a cone with sources in it
If your debug cone returns zero rows, you can't tell whether the
query is correct. Pick a center where you know the catalog has
coverage — the Galactic plane for a Galactic survey, the DES
footprint for a southern dataset, a known science field for
Rubin. Verify with a one-line head(10) on a single catalog
before composing the full query.
The pattern — develop in a cone, run on the sky¶
The canonical workflow:
import acid
import astropy.units as u
from acid import agg
acid.init("catalogs.yaml", workers=8)
# The query, written once. No cone in here.
gaia = acid.open("gaia_dr3")
tm = acid.open("twomass_psc")
matched = (gaia
.where("phot_g_mean_mag < 18")
.crossmatch(tm, radius=1 * u.arcsec, dist_col="d")
.group_by("ruwe < 1.4")
.aggregate(n=agg.count(), mean_d=agg.mean("d"))
.where("n > 0"))
# 1. Develop in a 1° cone — runs in seconds.
with acid.in_cone((180.0, 0.0), radius=1 * u.deg):
df = matched.to_polars()
print(df) # eyeball, iterate, re-run
# 2. Production run — the same query object, no cone.
matched.save("matches_full_sky") # streams; or .export("m.parquet") for one file
Two things this pattern buys you:
- The query is the same in both phases. No
LIMITto remove, noWHEREclause to delete, no different code paths, and no need to rebuild the handle — the samematchedobject runs scoped inside the cone and full-sky outside it. The debug harness is the cone, not the query. - The slowest thing you'll ever debug is the cone. A 1° cone on
a full-sky HATS catalog typically reads a single-digit number of
partitions. Even a complex crossmatch+aggregate inside that cone
runs in seconds at most — fast enough to iterate while you read
the printed
df.
When the in-cone numbers look right, just materialize outside the with
block for the full-sky run. The shift from "debug" to "production" is one
line of indentation, not a code rewrite.
Other lazy-vs-eager moments worth knowing¶
A few smaller properties that follow from the same composition / materialization split:
Catalogis immutable. Composition verbs return a new handle; the old one is unchanged. Branching is free —gaia_bright = gaia.where("g < 18")andgaia_faint = gaia.where("g > 20")are two independent handles over the same connection..columns,.alias,describe(),explain()are cheap and do no I/O beyond cached metadata.describe()reads the per-partition row counts that are already in HATS metadata; it does not scan the rows.explain()returns the analyzed plan as a string for inspection. Use them when you're not sure what you're about to materialize.Resultis materialized, but conversions are cheap.r.to_pandas(),r.to_polars(),r.to_arrow()all work over the same in-memorypa.Table(or stream from disk if the result was spilled). Converting between them does not re-run the query. See Working with results for the converter table.Catalog.save(path, name=...)is the one composition-shaped verb that does materialize. It writes a HATS catalog to disk and re-registers it on the connection, then returns a freshCatalogpointing at the saved tree. Use it when the same intermediate feeds many downstream queries — pay for the materialization once, reuse for free.
When head(n) is enough — and when it isn't¶
A rule of thumb:
| You want to check | Reach for |
|---|---|
| Schema / projection / column types | head(10) |
| A non-empty result exists at all | head(10) |
| Filter selectivity — "how many rows pass?" | cone + .aggregate(n=agg.count()) |
| Crossmatch quality — separation distribution, NULLs | cone + .to_polars() |
| Aggregate output shape — group keys, NULLs in keys | cone + .to_polars() |
| Distribution-shape questions ("is my color cut right?") | cone + .to_astropy() + your usual plotting |
head(n) is cheaper to type and faster on a selective query, but it
gives you a sample, not a slice. Anything where the distribution
matters — a separation histogram, a magnitude cut, an aggregate
breakdown — wants a cone.
See also¶
- Crossmatching catalogs — the matcher verb most
often debugged with this pattern, plus the four astronomy-
correctness checks (
J2000epoch, radius vs. margin cache, RA-wrap / poles, output units). - Aggregating — the other shape that composes
cheaply (
group_by/aggregate/ post-aggregatewhere/sortare all lazy) and is best developed inside a cone before being unleashed. - Working with results — once the query is right and
you have a
Result, this is where the converter table and the I/O behavior live. - Performance & parallelism — once the query is right and you want to make the full-sky run go faster: workers, threads, memory.