Skip to content

SQL features

This page is the canonical "what does ACID's SQL surface accept, and what does it reject" reference. It is what acid.sql.query(...) will run without complaint and the analyzer-enforced rules it applies. Use it when something raises and you want to know why; use the SQL escape hatch guide for worked examples.

ACID's SQL is parsed by SQLGlot and the accepted subset is what ACID's analyzer accepts. Two extensions are unique to ACID:

  • XMATCH(radius_arcsec => ...) in a JOIN ... ON clause for spherical crossmatching.
  • IN_MOC(<alias>, '<name>') as a per-row predicate restricting a query to a MOC footprint.

Both have rules about where they may appear; both are documented below.

At a glance

Feature Supported? Notes
SELECT columns / expressions yes Including aliased expressions (floor(mag) AS bin).
WHERE predicates (conjunctive, disjunctive) yes Per-partition pushdown for column / cone / MOC predicates.
GROUP BY (with a supported aggregate) yes See Aggregates below.
Bare GROUP BY (no aggregate) no ValidationError. Use an aggregate or de-duplicate the result.
HAVING yes Over aggregate output / group keys.
ORDER BY ... LIMIT K (top-K) yes LIMIT is pushed to each partition.
ORDER BY ... (no LIMIT) no ValidationError. Add a LIMIT, or sort the materialized result.
LIMIT K, OFFSET yes
SELECT DISTINCT no ValidationError. .to_polars().unique() / GROUP BY + aggregate.
COUNT(DISTINCT col) (and other DISTINCT aggregates) no ValidationError. .to_polars().n_unique().
INNER JOIN ... ON a.x = b.x (integer-ID equi-join) yes Both keys integer ID columns. The right catalog must be localized (a _healpix_29 column + a margin cache); see the SQL escape hatch.
LEFT JOIN ... ON a.x = b.x yes
LEFT JOIN ... ON XMATCH(...) yes
RIGHT JOIN, FULL JOIN, CROSS JOIN no ValidationError. Swap roles and use LEFT JOIN.
XMATCH(...) mixed with other predicates in ON no ValidationError (XMATCH must be the entire ON predicate).
IN_MOC(<alias>, '<name>') in conjunctive WHERE yes Optionally NOT IN_MOC(...). Fast partition / row-group pushdown.
IN_MOC(...) in SELECT / ORDER BY / HAVING / CASE no ValidationError. MOC membership is a restriction, not a per-row scalar.
IN_MOC(...) inside OR (any disjunction) no ValidationError. Restructure into conjunctive WHERE.
IN_MOC(...) in JOIN ... ON no ValidationError. Move to a top-level WHERE.
Inline subquery in FROM / XMATCH JOIN RHS / ordinary JOIN RHS yes One shape only: (SELECT * FROM <catalog> [WHERE <row-level predicate>]) AS alias.
CTE (WITH ... AS (...)) in any of those positions yes Same shape restriction as the inline subquery form.
CTE-to-CTE references / WITH RECURSIVE no ParseError. The CTE body's FROM must name a registered catalog.
Decomposable aggregates (COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, BOOL_AND, BOOL_OR) yes Phase-1 partials, partial aggregation.
Non-decomposable aggregates (MEDIAN, MODE, percentiles, STRING_AGG) no ValidationError. Aggregate down decomposably, finish in Polars / pandas.
Window functions (ROW_NUMBER() OVER (...)) no ValidationError (Window functions are not supported across partitioned catalogs.).
CASE WHEN ... THEN ... ELSE ... END yes In SELECT, WHERE, HAVING. Not for IN_MOC (see above).
Arithmetic / comparison / boolean operators yes
IS NULL / IS NOT NULL, BETWEEN, IN (...) yes

The rest of this page expands each row that has rules attached.

The XMATCH extension

JOIN <right> ON XMATCH(radius_arcsec => 1.0)
JOIN <right> ON XMATCH(r => 1.0)
JOIN <right> ON XMATCH(r => 1.0, mode => 'all')
JOIN <right> ON XMATCH(r => 1.0, dist_col => 'sep')
LEFT JOIN <right> ON XMATCH(r => 1.0)

Allowed keyword arguments

Keyword Required Type Notes
radius_arcsec yes float The match radius, in arcseconds. Alias: r. Must be positive.
mode no string 'nearest' (default) or 'all'. Lower-case; case-insensitive.
dist_col no string Output column for the great-circle separation, in arcseconds.

Anything else is rejected:

ValidationError: unknown XMATCH keyword argument(s): ['k']
hint: supported keywords are radius_arcsec (alias r), mode, and dist_col.

Positional arguments and non-literal values are rejected at parse time (only literal values are accepted in XMATCH() kwargs).

Allowed positions

XMATCH must be the entire ON predicate of a JOIN (or LEFT JOIN). The analyzer rejects every other position.

Position Verdict
JOIN b ON XMATCH(...) yes
LEFT JOIN b ON XMATCH(...) yes
JOIN b ON XMATCH(...) AND a.mag < 20 no — XMATCH must be the entire ON predicate; compound predicates like 'XMATCH(...) AND ...' are not supported. Move the extra term to WHERE.
RIGHT JOIN, FULL JOIN, CROSS JOIN ... ON XMATCH(...) no — unsupported XMATCH join type 'RIGHT'; acid supports INNER (JOIN) and LEFT JOIN only. Swap sides.
XMATCH(...) in SELECT / WHERE / etc. no — XMATCH is a join predicate, not a scalar function.

Right-side requirements

For every XMATCH, the analyzer checks the right catalog's margin cache at analyze time, before any data is read:

ValidationError: XMATCH right table 'twomass_psc' has no neighbor_path (margin cache) configured
hint: build one with `acid hats build-margin <catalog>`.
ValidationError: XMATCH radius_arcsec=5.0 exceeds 'twomass_psc'`s neighbor_margin_arcsec=1.0; matches near partition boundaries would be silently missed
hint: rebuild the margin cache at a larger radius, or shrink XMATCH radius.

See the crossmatch guide for the why and the fix.

The right catalog must also expose ra_col / dec_col in the registry (XMATCH right table 'x' has no ra_col/dec_col in the registry), as must the anchor (anchor table 'a' has no ra_col/dec_col in the registry).

The IN_MOC extension

WHERE IN_MOC(<table_alias>, '<moc_name>')
WHERE NOT IN_MOC(<table_alias>, '<moc_name>')

The first argument is a table alias in the query (the side the predicate is restricting). The second is a string literal naming either a registered MOC or a registered catalog (in the latter case ACID auto-loads the catalog's point_map.fits footprint).

Allowed positions

IN_MOC(...) is a footprint restriction, not a per-row scalar. The analyzer allows it in exactly one place:

  • A top-level AND-ed term in a WHERE clause, optionally negated with NOT (any even number of NOTs is fine — NOT NOT IN_MOC(...) collapses to the positive predicate).

Every other position is rejected:

Position Verdict
WHERE IN_MOC(a, 'des') yes
WHERE IN_MOC(a, 'des') AND mag < 18 yes
WHERE NOT IN_MOC(a, 'mask') yes
WHERE IN_MOC(a, 'des') OR IN_MOC(a, 'delve') no — IN_MOC() inside WHERE must sit in a conjunctive position (top-level AND-chain, optionally negated with NOT). Found inside Or; move the predicate to the top-level WHERE or to a SELECT projection (which evaluates per-row). Move the disjunction inside a single combined MOC, or split into two queries and union the results downstream.
SELECT IN_MOC(a, 'des') AS in_field no — IN_MOC() is only supported as a footprint restriction in WHERE (a top-level AND-ed term, optionally negated with NOT) or via the fluent .in_region(...) verb. It is not allowed in SELECT projections, ORDER BY, HAVING, or CASE.
ORDER BY IN_MOC(a, 'des'), HAVING IN_MOC(...) no — same message as above.
CASE WHEN IN_MOC(a, 'des') THEN ... END no — same message as above.
JOIN b ON IN_MOC(a, 'des') no — IN_MOC() is not supported in JOIN ON predicates; move it to WHERE or a SELECT projection.
IN_MOC(...) inside an inline subquery's WHERE no — IN_MOC(...) is not allowed inside an inline subquery's WHERE. Use a top-level WHERE; that path has the fast partition / row-group pruning for MOC predicates.

The fluent equivalent is Catalog.in_region(...); see Sky regions & footprints.

MOC name resolution

The second argument is resolved in this order:

  1. A MOC registered explicitly via Connection.register_moc(name, source) or a mocs: section in the YAML config.
  2. A catalog name registered on the connection. The analyzer lazily loads the catalog's point_map.fits footprint under that name; subsequent uses are cached.

Unknown names raise IN_MOC() references unknown MOC '<name>' with a "did you mean…?" suggestion drawn from the registered MOC and catalog names.

Unknown aliases raise IN_MOC() references unknown table alias '<alias>' (with a "did you mean…?" suggestion from the known aliases).

Inline subqueries and CTE pre-filters

Three positions accept a restricted subquery / CTE form:

  • the anchor FROM,
  • an XMATCH JOIN's right-hand side,
  • an ordinary JOIN's right-hand side.

The accepted inner shape is exactly:

(SELECT * FROM <registered_catalog> [WHERE <row-level predicate>]) AS alias

The WHERE clause is folded into ACID's per-partition pre-filter plumbing. For XMATCH with mode => 'nearest' this gives "nearest surviving match" semantics — the matcher only sees passing rows.

The full design rationale (the three-layer validation, why the predicate grammar is restricted to a SQL ∩ Polars intersection) is in the archive at docs/archive/SUBQUERY-RHS.md.

Restrictions on the inner SELECT

The inner SELECT must be SELECT *. The analyzer rejects (as ParseError) every other shape:

Inner clause / construct Verdict
SELECT id, ra, dec FROM ... no — inline subquery must be 'SELECT *'. Narrower projections are not supported in v1.
SELECT * EXCEPT (col) FROM ... no — inline subquery must be plain 'SELECT *' — EXCLUDE / EXCEPT / REPLACE / RENAME are not supported.
JOIN, GROUP BY, HAVING no — inline subquery cannot contain JOIN (and same for the others).
ORDER BY, LIMIT, OFFSET no — inline subquery cannot contain ORDER BY (and same for the others).
DISTINCT, QUALIFY, WINDOW, WITH no — same shape of error.
Nested subqueries / CTEs in body no — inline subquery's FROM must be a single physical table.
WITH RECURSIVE no — WITH RECURSIVE is not supported; CTEs in acid must reference a registered catalog directly.
CTE body's FROM is another CTE no — CTE 'a' body's FROM ('b') is itself a CTE; CTE-to-CTE references are not supported.

Restrictions on the inner WHERE

The inner WHERE is narrower than the outer one because it is lowered through Polars' pl.sql_expr for parquet pushdown, which means it has to stay in the subset Polars can push down to parquet row-group filters. The allowlist:

Construct Allowed
Column references (mag, flags, ra), literals (numeric / string / bool / null) yes
Comparisons: =, <>, <, <=, >, >= yes
Logical: AND, OR, NOT, parentheses yes
Null-test: IS NULL, IS NOT NULL yes
BETWEEN ... AND ... yes
IN (<literal>, <literal>, ...) yes
Arithmetic: +, -, *, /, %, unary - yes
Function calls (floor(...), sqrt(...), anything else) no — function calls (<name>) are not allowed inside an inline subquery's WHERE.
CAST(x AS type) / x::type no — CAST / ::type is not allowed inside an inline subquery's WHERE.
LIKE, ILIKE, SIMILAR TO no — string-matching dialects diverge between engines.
IN (<subquery>) no — IN (<subquery>) is not allowed inside an inline subquery's WHERE; use IN (<literal>, ...) instead.
Subqueries no
IN_MOC(...) (it has its own fast path at the outer WHERE) no
Window functions / aggregates no
Correlated references to outer aliases no — column qualifier 'x' does not refer to this subquery's inner scope.

The outer (top-level) WHERE has no such allowlist — it runs after the data is in memory and accepts the full SQLGlot-parsed expression set.

If you need a wider inner shape — a narrower projection, a JOIN, an aggregate — materialize the intermediate with Catalog.save(...), register it on the connection, and reference it as a plain table in the outer query.

Aggregates

The full task-shaped story is in the aggregation guide. The reference summary:

Decomposable (supported)

COUNT, SUM, MIN, MAX, AVG, STDDEV / STDDEV_POP, VARIANCE / VARIANCE_POP, BOOL_AND, BOOL_OR, plus BIT_AND / BIT_OR / BIT_XOR as their bitwise-aggregate aliases.

These run as partial aggregation: phase 1 computes per-partition partials (one row per group per partition), phase 2 combines them. The cost scales with the number of groups, not the number of rows.

Non-decomposable (rejected)

ValidationError: MEDIAN() cannot be executed across partitions. Supported aggregates: COUNT, SUM, MIN, MAX, AVG, STDDEV, STDDEV_POP, VARIANCE, VARIANCE_POP, BOOL_AND, BOOL_OR.

The rejected family is: MEDIAN, MODE, PERCENTILE_CONT, PERCENTILE_DISC, QUANTILE, STRING_AGG, GROUP_CONCAT. The escape hatch is to aggregate down with the decomposable verbs and finish in Polars or pandas (r.to_polars().group_by(...).agg(pl.col("mag").median())).

Deprecated global-reduce shapes (rejected)

These four shapes need every row of every partition at once and have no decomposable form. They are rejected with a rewrite hint, not a fallback:

Rejected shape Hint in the error
SELECT DISTINCT ... Project the columns and de-duplicate the materialized result (e.g. db.sql(...).to_polars().unique()), or GROUP BY with an aggregate.
COUNT(DISTINCT ...) (and other DISTINCT aggregates) Compute the distinct count on the materialized result (e.g. db.sql(...).to_polars().n_unique()).
ORDER BY ... (no LIMIT) Add a LIMIT for top-K, or sort the materialized result.
GROUP BY ... (no aggregate) Add an aggregate (COUNT/SUM/AVG/MIN/MAX/...), or de-duplicate the materialized result.

Window functions

Rejected at analyze time. The architecture is partition-local, so a window like ROW_NUMBER() OVER (PARTITION BY object_id ORDER BY mjd) that needs to see rows across partitions has no correct execution.

ValidationError: Window functions are not supported across partitioned catalogs.

For "per-object" patterns that fit a single partition's rows (e.g. one row per object, computed from many forced-source rows), the canonical replacement is GROUP BY object_id with a decomposable aggregate.

The GROUP BY SELECT-alias gotcha

The analyzer resolves GROUP BY terms to physical column names, and a SELECT alias of a computed expression is not a column. This breaks:

SELECT CAST(mag AS BIGINT) AS bin, COUNT(*) AS n
FROM   obs
GROUP BY bin                                  -- alias of a computed expr

The error is a generic column-resolution failure — ValidationError: unknown column 'bin' — the analyzer doesn't recognize the alias as referring to the SELECT expression. (If your traceback shows this message and you're grouping by an aliased computed expression, this section is the cause.) The fix on the SQL side is to group by the expression itself:

SELECT CAST(mag AS BIGINT) AS bin, COUNT(*) AS n
FROM   obs
GROUP BY CAST(mag AS BIGINT)

The fluent surface has no such limit — group_by("CAST(mag AS BIGINT) AS bin") recognizes the alias, groups by the underlying expression, and names the output column bin for you. If you find yourself writing the same expression on both sides of a GROUP BY, prefer the fluent surface (see the aggregation guide).

Ordinary joins

JOIN <right> ON <a>.<key> = <b>.<key>
LEFT JOIN <right> ON <a>.<key> = <b>.<key>

The right table must be a registered catalog (an unresolved relation has no on-disk parquet to scan — there is no attached-DB / table-function backend, so the analyzer rejects it explicitly):

ValidationError: ordinary JOIN right-hand table 'tmp' is not a registered catalog

When the right side is an inline subquery whose inner FROM is unregistered:

ValidationError: inline subquery on JOIN right side requires the inner FROM table ('tmp') to be a registered catalog
hint: register the catalog with acid.Registry, or drop the subquery and use a plain table reference.

Both keys in the ON predicate must be integer-ID columns (see the SQL escape hatch). Use acid.sql.query(...) for arbitrary join keys; the fluent Catalog.join only accepts integer-ID columns.

Frame, units, and conventions

The dialect rules above are about shape. The data conventions are about meaning:

  • All RA/Dec are J2000 (ICRS). No epoch propagation or proper motion is applied. See the crossmatch guide's epoch section.
  • XMATCH(radius_arcsec => ...) is in arcseconds. The r => ... alias is the same unit.
  • dist_col => '<name>' produces a float64 column in arcseconds. Nullable on unmatched LEFT-join rows.
  • acid.in_cone(center, radius=...) cones are in degrees, ICRS. The CLI's --cone RA,DEC,RADIUS_DEG flag is the same convention.

See also