Skip to content

Writing SQL (the escape hatch)

The fluent Catalog verbs cover most of what astronomers ask of acid: where, select, crossmatch, join, in_region, group_by / aggregate (a post-aggregate where is the HAVING role), collect_lists, with_columns, sort / limit, save. Each of those has its own task-shaped page:

This page covers acid.sql.query(...) — the SQL escape hatch — which is what you reach for when:

  • you want one of the SQL-only shapes (a window function, an inline subquery, a CTE pre-filter, a non-decomposable aggregate that you know acid will reject so you can fall back to a different query shape);
  • you already have a SQL string from somewhere else (a notebook cookbook, a colleague's pipeline) and don't want to translate it;
  • you're more comfortable in SQL than in a fluent Python chain.

acid.sql.query(query) always returns a Result; the dialect, the extensions (XMATCH, IN_MOC), and the rejection rules are all documented below.

The SQL surface

acid's SQL is the standard SELECT / FROM / JOIN / WHERE / GROUP BY / HAVING / ORDER BY / LIMIT shape every astronomer has seen, plus two astronomy-specific extensions:

  • XMATCH(radius_arcsec => …) in JOIN ... ON clauses for spherical crossmatching — the full reference for which is in the crossmatch guide. The dist_col => '<name>' named argument surfaces the match distance (arcseconds) as a named column.
  • IN_MOC(<alias>, '<name>') as a per-row predicate restricting rows to a MOC footprint — full reference in sky regions & footprints.

The dialect is parsed by SQLGlot and the accepted subset is what ACID's analyzer accepts. The rest of this page covers the bits you won't find on the fluent surface.

Inline subqueries and CTEs in FROM / JOIN-RHS

ACID accepts inline subqueries (and CTE references) in three positions — the anchor FROM, an XMATCH JOIN's right-hand side, and an ordinary JOIN's right-hand side — restricted to one shape:

(SELECT * FROM <registered_catalog> [WHERE <row-level predicate>]) AS alias

The WHERE clause is folded into ACID's per-partition pre-filter plumbing. For XMATCH with mode => 'nearest' this gives "nearest surviving match" semantics — the matcher only sees passing rows, so the closest is the closest of the survivors:

SELECT a.id, b.id AS b_id
FROM   (SELECT * FROM gaia_dr3 WHERE phot_g_mean_mag < 16) AS a
JOIN   (SELECT * FROM twomass_psc WHERE j_m < 14) AS b
  ON   XMATCH(radius_arcsec => 1.0)

A CTE form is equivalent (and often more readable):

WITH a AS (SELECT * FROM gaia_dr3    WHERE phot_g_mean_mag < 16),
     b AS (SELECT * FROM twomass_psc WHERE j_m < 14)
SELECT a.id, b.id AS b_id
FROM   a JOIN b ON XMATCH(radius_arcsec => 1.0)

Restrictions on the inner SELECT

The inner SELECT shape is restricted to keep the pre-filter cheap and correct. Reject-with-error for:

  • a projection narrower than * (SELECT id, ra, dec FROM gaia — no);
  • JOINs, GROUP BY / aggregates, DISTINCT, ORDER BY, LIMIT, sub-subqueries inside the body;
  • CTE-to-CTE references (the CTE body's FROM must name a registered catalog).

The inner WHERE is also restricted — comparisons, logical operators, IS NULL, BETWEEN, IN (...) of literals, simple arithmetic. No function calls, CAST, LIKE, or subqueries inside the inner WHERE. This is narrower than the outer WHERE allowlist because the inner predicate is lowered through pl.sql_expr for parquet pushdown; the outer one runs after the data is in memory.

If you need a wider inner-SELECT shape — say, a JOIN or a narrowed projection inside the subquery — materialize the intermediate with Catalog.save(...), register it on the connection, and reference it by name in the follow-up query. The save round-trip is the supported path for "a SQL pre-step the dialect doesn't support".

Raw files in SQL queries

acid.open(<file>) returns a fluent Catalog but does not register a name, so acid.sql.query("... FROM <file> ...") can't see it (an ad-hoc file can't shadow a configured catalog). To reference a raw file by name in SQL, register it explicitly with register_file(name, path, ra=…, dec=…)acid.register_file(...) on the default connection, or the same-named method on an explicit Connection:

import acid

acid.init("/data/hats")
acid.register_file("targets", "candidates.csv", ra="RA", dec="DEC")
r = acid.sql.query("""
    SELECT t.id, g.source_id
    FROM   targets t
    JOIN   gaia_dr3 g ON XMATCH(radius_arcsec => 1.0)
""")
tbl = r.to_astropy()

register_file spills the file once to a virtual catalog and puts name in the registry, so both acid.sql.query("... FROM <name> ...") and acid.open(<name>) resolve it. ra/dec are required (never guessed).

If you don't need the named SQL surface, the fluent path is simpler still — acid.open("candidates.csv", ra="RA", dec="DEC") gives you a Catalog you can .crossmatch(...) straight away (no registration).

On the CLI the same thing is acid query --open:

acid query "SELECT t.id, g.source_id FROM t JOIN gaia_dr3 ON XMATCH(radius_arcsec => 1.0)" \
    --db /data/hats --open t=candidates.csv,ra=RA,dec=DEC

See bring your own target list for the fluent counterpart (acid.open(<file>, ra=…, dec=…)), which is the simpler path when you don't need the SQL surface.

Attaching an ID-keyed lookup table

When you have a small, position-less lookup table — an id→label map, a classification, a per-object weight — Catalog.join(<frame>, on=…) joins it in directly. The right operand can be a polars, pandas, NumPy-structured, pyarrow, or Astropy frame (not just another Catalog):

attach_labels.py
import acid
import polars as pl

acid.init("/data/hats", workers=8)

labels = pl.DataFrame({"source_id": [...], "var_class": [...]})

classified = (acid.open("gaia_dr3")
              .where("phot_g_mean_mag < 18")
              .join(labels, on="source_id"))     # attach var_class by id

The frame is broadcast — spilled once to a small Arrow file, read whole into every worker, and hash-joined locally on the integer key. No coordinates, no reshuffle; the result stays partitioned by the left catalog. how="inner" (default) or how="left" both work.

This is a fluent-only feature and has some limits to know:

  • Integer keys only — string join keys are not yet supported.
  • A frame has no coordinates, so it's a .join() RHS only; .crossmatch(<frame>) errors (open it with acid.open(frame, ra=…, dec=…) for a spatial match instead).
  • nested=True over a broadcast frame is not supported yet.
  • There is no db.sql / CLI surface for broadcast frames — for a named SQL table use register_file (above), which builds a spatial virtual catalog and needs ra/dec.

For a large lookup keyed by a non-spatial id, or one keyed by string, materialize it as a co-partitioned HATS catalog and use a normal equi-join instead.

Composable (bushy) joins

A crossmatch / join operand may itself be a full join chain, not just a single catalog. This stays one per-partition program — no shuffle — because every sub-expression is HEALPix-partitioned by its leftmost catalog:

import astropy.units as u

# crossmatch whose RHS is itself an equi-joined pair (b ⋈ c):
a.crossmatch(b.join(c, on="objectId"), radius=1 * u.arcsec)

# the outer equi key may bind ANY leaf of the operand sub-spine — e.g.
# an association table joined to ZTF, then matched into Gaia by gaia_id:
gaia.join(ztf.join(ztf2gaia, on="ztf_id"), on="gaia_id")

Two correctness rules apply:

  • Margin budget. A leaf reached through nested spatial matches can be pulled toward the root by the sum of the match radii on its path — so a deep leaf in a multi-radius chain needs a margin cache at least that wide, or acid rejects the query at compile time.
  • An operand is a relation, not a result. A pre-filter where, a with_columns, an inner crossmatch/join, and a bare/renamed .select() shape the operand; an operand .aggregate() / group_by / sort / limit is rejected (materialize via .save() and re-open to reduce an operand first).

The operand must be spatially bounded — its key-sharing rows localized within a finite margin. A scattered, non-spatial id↔id mapping needs the broadcast path (above) or a materialized, co-partitioned catalog.

Window functions

Window functions (OVER (PARTITION BY ... ORDER BY ...)) are only available on the SQL surface, not the fluent one. They are accepted when they are decomposable along the partition direction — i.e. the PARTITION BY key is an integer-ID column that divides the work along the same lines as the HEALPix partitioning. Cross-partition windows are rejected (the engine has no global sort step in scope).

SELECT id,
       mjd,
       mag,
       mag - AVG(mag) OVER (PARTITION BY object_id) AS dmag
FROM   forced_source
WHERE  band = 'r'

If your window needs to span the whole table, the supported workaround is the same as for non-decomposable aggregates: materialize and post-process in pandas / Polars after .to_polars().

Things acid's SQL won't accept

These shapes are rejected at parse or analyze time, by design. The error message names the rejected shape and (where the fix is obvious) hints at the way to write it:

  • MEDIAN, MODE, percentile functions, and other non-decomposable aggregates — they would silently require global materialization. See aggregation > Why no agg.median?.
  • COUNT(DISTINCT col) and SELECT DISTINCT — same reason.
  • A bare GROUP BY with no aggregates — equivalent to DISTINCT, same rejection.
  • An unbounded ORDER BY with no LIMIT — a full global sort. Add a LIMIT K to get the top-K push-down, or save and sort locally.
  • RIGHT JOIN, FULL JOIN, CROSS JOIN — swap roles and use LEFT JOIN instead.
  • XMATCH mixed with another ON predicateXMATCH must be the whole ON clause; move the extra term to WHERE.
  • IN_MOC(...) outside a top-level AND-ed WHERE — in SELECT projection, ORDER BY, HAVING, CASE, a disjunction, or a JOIN ON. See sky regions & footprints.
  • Subqueries with JOIN / aggregates / DISTINCT / narrow projection in the inner SELECT (see the section above).
  • CTE-to-CTE references.

Each rejection comes back as a ValidationError (analyzer rule) or ParseError (SQL shape). See errors for the full set.

When to drop into SQL vs. stay in the fluent surface

A quick guide:

What you want to do Prefer fluent? Prefer SQL?
Simple crossmatch + projection
Filtering by mag / color / flags
group_by / decomposable aggregates
Top-K (sort + limit)
Nested joins (nested=True) / collect_lists ✓ (only)
Python UDFs (with_columns / @acid.function) ✓ (only)
Broadcast join against an in-memory frame ✓ (only)
maxmatch=N (up to N nearest) ✓ (only)
Inline subquery pre-filter / CTE
Window functions ✓ (only)
map_partitions SQL table functions ✓ (only)
Translating a SQL snippet from somewhere else
Composing query pieces from Python state
Multi-step pipelines with intermediate .save()

A future "Verbs or SQL — which should I use?" guide will turn this into a full decision aid; for now, the rule of thumb is fluent first, drop into SQL when the verb doesn't exist.

See also