Writing SQL (the escape hatch)¶
The fluent Catalog verbs cover most of what astronomers ask of acid:
where, select, crossmatch, join, in_region, group_by /
aggregate (a post-aggregate where is the HAVING role),
collect_lists, with_columns, sort / limit, save. Each of those
has its own task-shaped page:
- Crossmatching catalogs
- Filtering rows
- Aggregating
- Sky regions & footprints
- Light curves for a list of targets
This page covers acid.sql.query(...) — the SQL escape hatch — which is
what you reach for when:
- you want one of the SQL-only shapes (a window function, an inline
subquery, a CTE pre-filter, a non-decomposable aggregate that you
know
acidwill reject so you can fall back to a different query shape); - you already have a SQL string from somewhere else (a notebook cookbook, a colleague's pipeline) and don't want to translate it;
- you're more comfortable in SQL than in a fluent Python chain.
acid.sql.query(query) always returns a Result; the dialect, the
extensions (XMATCH, IN_MOC), and the rejection rules are all
documented below.
The SQL surface¶
acid's SQL is the standard SELECT / FROM / JOIN / WHERE / GROUP BY /
HAVING / ORDER BY / LIMIT shape every astronomer has seen, plus two
astronomy-specific extensions:
XMATCH(radius_arcsec => …)inJOIN ... ONclauses for spherical crossmatching — the full reference for which is in the crossmatch guide. Thedist_col => '<name>'named argument surfaces the match distance (arcseconds) as a named column.IN_MOC(<alias>, '<name>')as a per-row predicate restricting rows to a MOC footprint — full reference in sky regions & footprints.
The dialect is parsed by SQLGlot and the accepted subset is what ACID's analyzer accepts. The rest of this page covers the bits you won't find on the fluent surface.
Inline subqueries and CTEs in FROM / JOIN-RHS¶
ACID accepts inline subqueries (and CTE references) in three
positions — the anchor FROM, an XMATCH JOIN's right-hand side,
and an ordinary JOIN's right-hand side — restricted to one shape:
The WHERE clause is folded into ACID's per-partition pre-filter
plumbing. For XMATCH with mode => 'nearest' this gives "nearest
surviving match" semantics — the matcher only sees passing rows,
so the closest is the closest of the survivors:
SELECT a.id, b.id AS b_id
FROM (SELECT * FROM gaia_dr3 WHERE phot_g_mean_mag < 16) AS a
JOIN (SELECT * FROM twomass_psc WHERE j_m < 14) AS b
ON XMATCH(radius_arcsec => 1.0)
A CTE form is equivalent (and often more readable):
WITH a AS (SELECT * FROM gaia_dr3 WHERE phot_g_mean_mag < 16),
b AS (SELECT * FROM twomass_psc WHERE j_m < 14)
SELECT a.id, b.id AS b_id
FROM a JOIN b ON XMATCH(radius_arcsec => 1.0)
Restrictions on the inner SELECT¶
The inner SELECT shape is restricted to keep the pre-filter cheap and correct. Reject-with-error for:
- a projection narrower than
*(SELECT id, ra, dec FROM gaia— no); JOINs,GROUP BY/ aggregates,DISTINCT,ORDER BY,LIMIT, sub-subqueries inside the body;- CTE-to-CTE references (the CTE body's
FROMmust name a registered catalog).
The inner WHERE is also restricted — comparisons, logical operators,
IS NULL, BETWEEN, IN (...) of literals, simple arithmetic. No
function calls, CAST, LIKE, or subqueries inside the inner WHERE.
This is narrower than the outer WHERE allowlist because the inner
predicate is lowered through pl.sql_expr for parquet pushdown; the
outer one runs after the data is in memory.
If you need a wider inner-SELECT shape — say, a JOIN or a
narrowed projection inside the subquery — materialize the
intermediate with Catalog.save(...), register it on the connection,
and reference it by name in the follow-up query. The save round-trip
is the supported path for "a SQL pre-step the dialect doesn't
support".
Raw files in SQL queries¶
acid.open(<file>) returns a fluent Catalog but does not register
a name, so acid.sql.query("... FROM <file> ...") can't see it (an ad-hoc
file can't shadow a configured catalog). To reference a raw file by
name in SQL, register it explicitly with register_file(name, path,
ra=…, dec=…) — acid.register_file(...) on the default connection, or
the same-named method on an explicit Connection:
import acid
acid.init("/data/hats")
acid.register_file("targets", "candidates.csv", ra="RA", dec="DEC")
r = acid.sql.query("""
SELECT t.id, g.source_id
FROM targets t
JOIN gaia_dr3 g ON XMATCH(radius_arcsec => 1.0)
""")
tbl = r.to_astropy()
register_file spills the file once to a virtual catalog and puts
name in the registry, so both acid.sql.query("... FROM <name> ...")
and acid.open(<name>) resolve it. ra/dec are required (never
guessed).
If you don't need the named SQL surface, the fluent path is simpler
still — acid.open("candidates.csv", ra="RA", dec="DEC") gives you a
Catalog you can .crossmatch(...) straight away (no registration).
On the CLI the same thing is acid query --open:
acid query "SELECT t.id, g.source_id FROM t JOIN gaia_dr3 ON XMATCH(radius_arcsec => 1.0)" \
--db /data/hats --open t=candidates.csv,ra=RA,dec=DEC
See bring your own target list
for the fluent counterpart (acid.open(<file>, ra=…, dec=…)), which is
the simpler path when you don't need the SQL surface.
Attaching an ID-keyed lookup table¶
When you have a small, position-less lookup table — an id→label
map, a classification, a per-object weight — Catalog.join(<frame>,
on=…) joins it in directly. The right operand can be a polars,
pandas, NumPy-structured, pyarrow, or Astropy frame (not just another
Catalog):
import acid
import polars as pl
acid.init("/data/hats", workers=8)
labels = pl.DataFrame({"source_id": [...], "var_class": [...]})
classified = (acid.open("gaia_dr3")
.where("phot_g_mean_mag < 18")
.join(labels, on="source_id")) # attach var_class by id
The frame is broadcast — spilled once to a small Arrow file, read
whole into every worker, and hash-joined locally on the integer key.
No coordinates, no reshuffle; the result stays partitioned by the left
catalog. how="inner" (default) or how="left" both work.
This is a fluent-only feature and has some limits to know:
- Integer keys only — string join keys are not yet supported.
- A frame has no coordinates, so it's a
.join()RHS only;.crossmatch(<frame>)errors (open it withacid.open(frame, ra=…, dec=…)for a spatial match instead). nested=Trueover a broadcast frame is not supported yet.- There is no
db.sql/ CLI surface for broadcast frames — for a named SQL table useregister_file(above), which builds a spatial virtual catalog and needsra/dec.
For a large lookup keyed by a non-spatial id, or one keyed by string, materialize it as a co-partitioned HATS catalog and use a normal equi-join instead.
Composable (bushy) joins¶
A crossmatch / join operand may itself be a full join chain, not
just a single catalog. This stays one per-partition program — no
shuffle — because every sub-expression is HEALPix-partitioned by its
leftmost catalog:
import astropy.units as u
# crossmatch whose RHS is itself an equi-joined pair (b ⋈ c):
a.crossmatch(b.join(c, on="objectId"), radius=1 * u.arcsec)
# the outer equi key may bind ANY leaf of the operand sub-spine — e.g.
# an association table joined to ZTF, then matched into Gaia by gaia_id:
gaia.join(ztf.join(ztf2gaia, on="ztf_id"), on="gaia_id")
Two correctness rules apply:
- Margin budget. A leaf reached through nested spatial matches can
be pulled toward the root by the sum of the match radii on its
path — so a deep leaf in a multi-radius chain needs a margin cache
at least that wide, or
acidrejects the query at compile time. - An operand is a relation, not a result. A pre-filter
where, awith_columns, an innercrossmatch/join, and a bare/renamed.select()shape the operand; an operand.aggregate()/group_by/sort/limitis rejected (materialize via.save()and re-open to reduce an operand first).
The operand must be spatially bounded — its key-sharing rows localized within a finite margin. A scattered, non-spatial id↔id mapping needs the broadcast path (above) or a materialized, co-partitioned catalog.
Window functions¶
Window functions (OVER (PARTITION BY ... ORDER BY ...)) are only
available on the SQL surface, not the fluent one. They are accepted
when they are decomposable along the partition direction — i.e. the
PARTITION BY key is an integer-ID column that divides the work
along the same lines as the HEALPix partitioning. Cross-partition
windows are rejected (the engine has no global sort step in scope).
SELECT id,
mjd,
mag,
mag - AVG(mag) OVER (PARTITION BY object_id) AS dmag
FROM forced_source
WHERE band = 'r'
If your window needs to span the whole table, the supported workaround
is the same as for non-decomposable aggregates: materialize and
post-process in pandas / Polars after .to_polars().
Things acid's SQL won't accept¶
These shapes are rejected at parse or analyze time, by design. The error message names the rejected shape and (where the fix is obvious) hints at the way to write it:
MEDIAN,MODE, percentile functions, and other non-decomposable aggregates — they would silently require global materialization. See aggregation > Why noagg.median?.COUNT(DISTINCT col)andSELECT DISTINCT— same reason.- A bare
GROUP BYwith no aggregates — equivalent toDISTINCT, same rejection. - An unbounded
ORDER BYwith noLIMIT— a full global sort. Add aLIMIT Kto get the top-K push-down, or save and sort locally. RIGHT JOIN,FULL JOIN,CROSS JOIN— swap roles and useLEFT JOINinstead.XMATCHmixed with anotherONpredicate —XMATCHmust be the wholeONclause; move the extra term toWHERE.IN_MOC(...)outside a top-level AND-edWHERE— inSELECTprojection,ORDER BY,HAVING,CASE, a disjunction, or aJOIN ON. See sky regions & footprints.- Subqueries with
JOIN/ aggregates /DISTINCT/ narrow projection in the inner SELECT (see the section above). - CTE-to-CTE references.
Each rejection comes back as a ValidationError (analyzer rule)
or ParseError (SQL shape). See errors for
the full set.
When to drop into SQL vs. stay in the fluent surface¶
A quick guide:
| What you want to do | Prefer fluent? | Prefer SQL? |
|---|---|---|
| Simple crossmatch + projection | ✓ | |
| Filtering by mag / color / flags | ✓ | |
group_by / decomposable aggregates |
✓ | |
Top-K (sort + limit) |
✓ | ✓ |
Nested joins (nested=True) / collect_lists |
✓ (only) | |
Python UDFs (with_columns / @acid.function) |
✓ (only) | |
| Broadcast join against an in-memory frame | ✓ (only) | |
maxmatch=N (up to N nearest) |
✓ (only) | |
| Inline subquery pre-filter / CTE | ✓ | |
| Window functions | ✓ (only) | |
map_partitions SQL table functions |
✓ (only) | |
| Translating a SQL snippet from somewhere else | ✓ | |
| Composing query pieces from Python state | ✓ | |
Multi-step pipelines with intermediate .save() |
✓ |
A future "Verbs or SQL — which should I use?" guide will turn this into a full decision aid; for now, the rule of thumb is fluent first, drop into SQL when the verb doesn't exist.
See also¶
- Crossmatching catalogs — the full reference for
XMATCH(radius, mode, dist_col, the margin-cache rule). - Sky regions & footprints — the full reference
for
IN_MOCand where it's allowed. - Aggregating — the decomposable set, top-K push-down, and why non-decomposable aggregates are rejected.
- Working with results & exporting — what
acid.sql.query(...)returns, in what units, and how to write it out. - SQL features reference — the enumerated dialect table.
- Errors reference — every rejection above, with the verbatim message.