Writing queries

You write ACID queries in SQL. The dialect is the standard SELECT / WHERE / GROUP BY / ORDER BY shape every astronomer has seen at some point, plus three astronomy-specific extensions:

  • XMATCH(radius_arcsec => …) in JOIN ... ON clauses for spherical crossmatching.
  • XMATCH_DISTANCE(<alias>) as a SELECT-level function returning the match distance in arcseconds.
  • IN_MOC(<alias>, '<name>') as a per-row predicate restricting rows to a MOC footprint.

This page is a tour. Each section ends with a one-line "things that don't work" note so you find dead ends fast.

SELECT, WHERE, ORDER BY, LIMIT

These all work the way you expect them to.

SELECT designation, ra, dec, j_m
FROM   two_mass
WHERE  j_m < 14.0
ORDER BY j_m
LIMIT 100

acid only reads the columns you asked for from disk, even if the catalog has 60 more. Filter early and pick narrow column lists — both push down to parquet and make queries dramatically faster on large catalogs.

Doesn't work

CTEs (WITH …) and subqueries in the FROM clause are not supported in the anchor position. Use Session.materialize(...) to cache an intermediate result and reference it as a real catalog in a follow-up query.

Crossmatching: the XMATCH operator

The one new thing is what goes in ON:

SELECT a.source_id, b.designation
FROM   gaia     AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 1.0)

That means "match Gaia to 2MASS by sky position, with a 1-arcsec radius." acid reads the RA / Dec columns from each catalog's properties file — you do not name them in the query.

XMATCH takes two named arguments:

Argument Required? Meaning
radius_arcsec (alias: r) yes Match radius in arcseconds.
mode no 'nearest' (default) or 'all'.

'nearest' returns the single closest match within the radius (if any). 'all' returns every match within the radius. Use 'all' in crowded fields, or when you'll post-process to pick a different "best" criterion than nearness.

-- Every 2MASS source within 5″ of each Gaia source:
SELECT a.source_id, b.designation
FROM   gaia     AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 5.0, mode => 'all')

Behind the scenes, the radius you request must be at most the right-side catalog's margin-cache width — see Concepts for the reason. If you ask for a wider radius the query is rejected with a clear error rather than silently dropping boundary matches.

Doesn't work

XMATCH must be the entire ON predicate. Compound predicates like ON XMATCH(...) AND a.mag < 20 are rejected. Put non-XMATCH conditions in WHERE instead.

Match distance: XMATCH_DISTANCE(<alias>)

For every matched pair, XMATCH_DISTANCE(b) returns the great-circle distance in arcseconds. It works in SELECT, WHERE, and ORDER BY:

SELECT a.source_id,
       b.designation,
       XMATCH_DISTANCE(b) AS d_arcsec
FROM   gaia     AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 1.0)
WHERE  XMATCH_DISTANCE(b) < 0.5        -- tighter than the search radius
ORDER BY d_arcsec ASC
LIMIT  100

A common idiom is to widen the XMATCH radius (to catch every candidate) and then tighten with WHERE XMATCH_DISTANCE(...) < … when you'll re-rank or post-filter the matches.

LEFT JOIN: keeping unmatched anchors

A LEFT JOIN with XMATCH keeps every row of the left (anchor) catalog and adds NULLs where there's no match:

SELECT a.source_id, b.designation
FROM   gaia         AS a
LEFT JOIN two_mass  AS b ON XMATCH(radius_arcsec => 1.0)

This is how you ask "every Gaia source, with its 2MASS counterpart if any." The b.* columns are NULL for unmatched anchors.

Doesn't work

RIGHT JOIN, FULL JOIN, and CROSS JOIN are not supported. If you want "every 2MASS source with its Gaia counterpart", swap the roles: put 2MASS in FROM and LEFT JOIN gaia.

Filtering by footprint: IN_MOC

IN_MOC(<alias>, '<name>') returns TRUE when the row's sky position falls inside a named MOC (a sky region — see Concepts). Use it to restrict a query to a survey footprint, exclude a mask, or take the intersection of two regions:

SELECT a.source_id, a.ra, a.dec
FROM   gaia AS a
WHERE  IN_MOC(a, 'des_dr2_footprint')
  AND NOT IN_MOC(a, 'known_artifacts')

The name lookup happens in two passes: a MOC explicitly registered via s.register_moc() or a YAML mocs: entry always wins. If the name isn't registered but matches a catalog with a point_map.fits footprint, acid lazy-loads that as a MOC named after the catalog (useful for cross-survey footprint intersection — see MOC footprints).

IN_MOC also works in SELECT, ORDER BY, and CASE expressions as a per-row boolean:

SELECT a.source_id,
       IN_MOC(a, 'des_dr2_footprint') AS in_des
FROM   gaia AS a

Doesn't work

Two IN_MOC placements are rejected with ValidationError:

  • IN_MOC(...) inside a disjunction (IN_MOC(...) OR …).
  • IN_MOC(...) in a JOIN ON clause.

The reason is performance: those positions don't admit the partition-level pruning that makes IN_MOC cheap. Restructure into a top-level conjunctive WHERE and the query will fly.

Aggregates: GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX

Standard SQL aggregates work, and they are pushed down efficiently — each partition computes a partial aggregate, and a phase-2 step combines them. You don't materialise every row to disk just to count.

SELECT a.source_id,
       COUNT(*)                AS n_2mass_within_5arcsec,
       AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM   gaia     AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 5.0, mode => 'all')
GROUP BY a.source_id
HAVING COUNT(*) >= 2
ORDER BY avg_d ASC
LIMIT  100

Supported aggregates: COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT …), SUM(DISTINCT …). SELECT DISTINCT is also supported.

Doesn't work

  • Non-decomposable aggregatesMEDIAN, MODE, STRING_AGG, percentile functions, and similar. These cannot be combined from partial results without materialising everything; acid rejects them at parse time with a clear ValidationError. Compute them in pandas or Polars on r.df() / r.to_polars() after the query.
  • Window functions (OVER (...)). Rejected for the same reason.

COUNT(DISTINCT …) and SELECT DISTINCT fall back to a slower "materialise then deduplicate" path. They are correct but won't fly on billion-row inputs — consider whether you actually need distinctness or just a GROUP BY.

Ordinary joins mixed with XMATCH

You can chain ordinary equi-joins after an XMATCH — for example to attach a lightcurve table by object ID:

SELECT a.source_id,
       b.designation,
       lc.time,
       lc.mag
FROM   gaia        AS a
JOIN   two_mass    AS b  ON XMATCH(radius_arcsec => 1.0)
JOIN   lightcurves AS lc ON a.source_id = lc.gaia_source_id

Every XMATCH in a query uses the anchor (first FROM table) catalog's coordinates, even after a mode => 'all' expansion. That is why the anchor's choice matters: it determines which partitions drive the work.

Top-K queries: ORDER BY … LIMIT

ORDER BY … LIMIT K is pushed down to each partition: each partition keeps its own top-K, and a phase-2 step picks the global top-K from the union. The result is much faster than sorting the full join.

SELECT a.source_id, b.designation, XMATCH_DISTANCE(b) AS d
FROM   gaia     AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 1.0)
ORDER BY d ASC
LIMIT 100

This is the right shape whenever you want "the K closest matches" or "the K brightest sources within a footprint" — both fly even on billion-row inputs.

Summary: dialect at a glance

Feature Supported?
SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET
INNER JOIN (ordinary, a.x = b.x)
LEFT JOIN (ordinary or XMATCH)
RIGHT JOIN, FULL JOIN, CROSS JOIN
JOIN … ON XMATCH(...) (whole-ON only)
XMATCH_DISTANCE(<alias>) in SELECT, WHERE, ORDER BY
IN_MOC(<alias>, '<name>') in conjunctive WHERE
IN_MOC(...) in SELECT / ORDER BY / CASE
IN_MOC(...) inside OR or in JOIN ON
COUNT, SUM, AVG, MIN, MAX
COUNT(DISTINCT …), SELECT DISTINCT ✅ (slower)
MEDIAN, MODE, percentiles, window functions
CTEs (WITH …), subqueries in FROM

For the full enumerated reference see SQL features. For every error you can hit along the way see Errors.