Writing queries¶
You write ACID queries in SQL. The dialect is the standard
SELECT / WHERE / GROUP BY / ORDER BY shape every astronomer has seen
at some point, plus three astronomy-specific extensions:
XMATCH(radius_arcsec => …)inJOIN ... ONclauses for spherical crossmatching.XMATCH_DISTANCE(<alias>)as a SELECT-level function returning the match distance in arcseconds.IN_MOC(<alias>, '<name>')as a per-row predicate restricting rows to a MOC footprint.
This page is a tour. Each section ends with a one-line "things that don't work" note so you find dead ends fast.
SELECT, WHERE, ORDER BY, LIMIT¶
These all work the way you expect them to.
acid only reads the columns you asked for from disk, even if the
catalog has 60 more. Filter early and pick narrow column lists —
both push down to parquet and make queries dramatically faster on
large catalogs.
Doesn't work
CTEs (WITH …) and subqueries in the FROM clause are not
supported in the anchor position. Use
Session.materialize(...) to cache an
intermediate result and reference it as a real catalog in a
follow-up query.
Crossmatching: the XMATCH operator¶
The one new thing is what goes in ON:
That means "match Gaia to 2MASS by sky position, with a 1-arcsec
radius." acid reads the RA / Dec columns from each catalog's
properties file — you do not name them in the query.
XMATCH takes two named arguments:
| Argument | Required? | Meaning |
|---|---|---|
radius_arcsec (alias: r) |
yes | Match radius in arcseconds. |
mode |
no | 'nearest' (default) or 'all'. |
'nearest' returns the single closest match within the radius (if
any). 'all' returns every match within the radius. Use 'all' in
crowded fields, or when you'll post-process to pick a different
"best" criterion than nearness.
-- Every 2MASS source within 5″ of each Gaia source:
SELECT a.source_id, b.designation
FROM gaia AS a
JOIN two_mass AS b ON XMATCH(radius_arcsec => 5.0, mode => 'all')
Behind the scenes, the radius you request must be at most the right-side catalog's margin-cache width — see Concepts for the reason. If you ask for a wider radius the query is rejected with a clear error rather than silently dropping boundary matches.
Doesn't work
XMATCH must be the entire ON predicate. Compound
predicates like ON XMATCH(...) AND a.mag < 20 are rejected.
Put non-XMATCH conditions in WHERE instead.
Match distance: XMATCH_DISTANCE(<alias>)¶
For every matched pair, XMATCH_DISTANCE(b) returns the great-circle
distance in arcseconds. It works in SELECT, WHERE, and
ORDER BY:
SELECT a.source_id,
b.designation,
XMATCH_DISTANCE(b) AS d_arcsec
FROM gaia AS a
JOIN two_mass AS b ON XMATCH(radius_arcsec => 1.0)
WHERE XMATCH_DISTANCE(b) < 0.5 -- tighter than the search radius
ORDER BY d_arcsec ASC
LIMIT 100
A common idiom is to widen the XMATCH radius (to catch every
candidate) and then tighten with WHERE XMATCH_DISTANCE(...) < …
when you'll re-rank or post-filter the matches.
LEFT JOIN: keeping unmatched anchors¶
A LEFT JOIN with XMATCH keeps every row of the left (anchor)
catalog and adds NULLs where there's no match:
SELECT a.source_id, b.designation
FROM gaia AS a
LEFT JOIN two_mass AS b ON XMATCH(radius_arcsec => 1.0)
This is how you ask "every Gaia source, with its 2MASS counterpart
if any." The b.* columns are NULL for unmatched anchors.
Doesn't work
RIGHT JOIN, FULL JOIN, and CROSS JOIN are not supported.
If you want "every 2MASS source with its Gaia counterpart",
swap the roles: put 2MASS in FROM and LEFT JOIN gaia.
Filtering by footprint: IN_MOC¶
IN_MOC(<alias>, '<name>') returns TRUE when the row's sky
position falls inside a named MOC (a sky region — see
Concepts). Use it to restrict a query
to a survey footprint, exclude a mask, or take the intersection of
two regions:
SELECT a.source_id, a.ra, a.dec
FROM gaia AS a
WHERE IN_MOC(a, 'des_dr2_footprint')
AND NOT IN_MOC(a, 'known_artifacts')
The name lookup happens in two passes: a MOC explicitly registered
via s.register_moc() or a YAML mocs: entry always wins. If the
name isn't registered but matches a catalog with a point_map.fits
footprint, acid lazy-loads that as a MOC named after the catalog
(useful for cross-survey footprint intersection — see
MOC footprints).
IN_MOC also works in SELECT, ORDER BY, and CASE expressions
as a per-row boolean:
Doesn't work
Two IN_MOC placements are rejected with ValidationError:
IN_MOC(...)inside a disjunction (IN_MOC(...) OR …).IN_MOC(...)in aJOIN ONclause.
The reason is performance: those positions don't admit the
partition-level pruning that makes IN_MOC cheap. Restructure
into a top-level conjunctive WHERE and the query will fly.
Aggregates: GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX¶
Standard SQL aggregates work, and they are pushed down efficiently — each partition computes a partial aggregate, and a phase-2 step combines them. You don't materialise every row to disk just to count.
SELECT a.source_id,
COUNT(*) AS n_2mass_within_5arcsec,
AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM gaia AS a
JOIN two_mass AS b ON XMATCH(radius_arcsec => 5.0, mode => 'all')
GROUP BY a.source_id
HAVING COUNT(*) >= 2
ORDER BY avg_d ASC
LIMIT 100
Supported aggregates: COUNT, SUM, AVG, MIN, MAX,
COUNT(DISTINCT …), SUM(DISTINCT …). SELECT DISTINCT is also
supported.
Doesn't work
- Non-decomposable aggregates —
MEDIAN,MODE,STRING_AGG, percentile functions, and similar. These cannot be combined from partial results without materialising everything;acidrejects them at parse time with a clearValidationError. Compute them in pandas or Polars onr.df()/r.to_polars()after the query. - Window functions (
OVER (...)). Rejected for the same reason.
COUNT(DISTINCT …) and SELECT DISTINCT fall back to a slower
"materialise then deduplicate" path. They are correct but won't
fly on billion-row inputs — consider whether you actually need
distinctness or just a GROUP BY.
Ordinary joins mixed with XMATCH¶
You can chain ordinary equi-joins after an XMATCH — for example
to attach a lightcurve table by object ID:
SELECT a.source_id,
b.designation,
lc.time,
lc.mag
FROM gaia AS a
JOIN two_mass AS b ON XMATCH(radius_arcsec => 1.0)
JOIN lightcurves AS lc ON a.source_id = lc.gaia_source_id
Every XMATCH in a query uses the anchor (first FROM table)
catalog's coordinates, even after a mode => 'all' expansion. That
is why the anchor's choice matters: it determines which partitions
drive the work.
Top-K queries: ORDER BY … LIMIT¶
ORDER BY … LIMIT K is pushed down to each partition: each
partition keeps its own top-K, and a phase-2 step picks the global
top-K from the union. The result is much faster than sorting the
full join.
SELECT a.source_id, b.designation, XMATCH_DISTANCE(b) AS d
FROM gaia AS a
JOIN two_mass AS b ON XMATCH(radius_arcsec => 1.0)
ORDER BY d ASC
LIMIT 100
This is the right shape whenever you want "the K closest matches" or "the K brightest sources within a footprint" — both fly even on billion-row inputs.
Summary: dialect at a glance¶
| Feature | Supported? |
|---|---|
SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET |
✅ |
INNER JOIN (ordinary, a.x = b.x) |
✅ |
LEFT JOIN (ordinary or XMATCH) |
✅ |
RIGHT JOIN, FULL JOIN, CROSS JOIN |
❌ |
JOIN … ON XMATCH(...) (whole-ON only) |
✅ |
XMATCH_DISTANCE(<alias>) in SELECT, WHERE, ORDER BY |
✅ |
IN_MOC(<alias>, '<name>') in conjunctive WHERE |
✅ |
IN_MOC(...) in SELECT / ORDER BY / CASE |
✅ |
IN_MOC(...) inside OR or in JOIN ON |
❌ |
COUNT, SUM, AVG, MIN, MAX |
✅ |
COUNT(DISTINCT …), SELECT DISTINCT |
✅ (slower) |
MEDIAN, MODE, percentiles, window functions |
❌ |
CTEs (WITH …), subqueries in FROM |
❌ |
For the full enumerated reference see SQL features. For every error you can hit along the way see Errors.