Crossmatching catalogs¶
You have two sky-survey catalogs and want to associate sources by sky
position — every Gaia source within 1″ of a 2MASS source, every Rubin
object inside 5″ of a known artifact, that kind of thing. This is the
single most common thing astronomers reach for acid to do, and it is
the first page of this guide for a reason.
This page covers:
- How to write a crossmatch on either of the two surfaces — fluent
(
Catalog.crossmatch(...)) and SQL (XMATCH(...)inON). - The four trust-builders the matcher does — and does not — handle: the J2000-only epoch convention, the radius-vs-margin-cache rule (and how to fix it), RA-wrap / poles, output units.
- How to crossmatch a target list you bring yourself — a CSV, parquet, FITS, or an in-memory pandas/Astropy table — without an offline HATS import.
If you already know all of that and just want a recipe, jump to the examples at the bottom.
The shortest crossmatch¶
The two surfaces produce the same query; use whichever suits the rest
of your code. Pick the test fixtures a and b (b is
a shifted by 0.3″ in RA — the synthetic catalogs that come with the
test suite); for real data, substitute gaia_dr3 and twomass_psc or
whatever you have registered.
In both cases the result has one row per matched pair, with every column
from a and every column from b (right-side columns that clash with
the left get a _<alias> suffix — id and id_b, ra and ra_b).
The extra d_arcsec column carries the great-circle separation in
arcseconds (see Output units).
Fluent vs. SQL — pick one
The fluent surface works with flat column names like id and
id_b (the merged columns you see in the output), not a.id /
b.id. The SQL surface still understands alias.col. Most
queries on this page are shown in both; pick whichever shape your
downstream code already uses. See the (future) "Verbs or SQL —
which should I use?" guide for a decision aid.
crossmatch(...) — the verb in detail¶
Catalog.crossmatch takes one positional argument — the right-side
catalog — and keyword arguments:
left.crossmatch(
other, # a Catalog from acid.open(...)
*,
radius, # an astropy Quantity (1 * u.arcsec); bare float rejected
how="inner", # "inner" | "left" — the join type
maxmatch=1, # 1 | -1 (all) | N>=2 — the match multiplicity
dist_col=None, # "d_arcsec" surfaces the separation column
suffix=None, # override the "_<alias>" collision suffix
nested=False, # fold matches into per-row lists (see below)
order_by=None, # element order inside the nested lists
)
radius must be an astropy.units.Quantity with an angular unit
(1 * u.arcsec, 0.5 * u.deg). A bare number is rejected with a
ValueError: the arcsec-vs-degree ambiguity is a 3600× footgun, so the
unit is required. (The SQL surface, where the argument is literally
named radius_arcsec, is unambiguous and takes a bare number.)
Throughout, the anchor is the left-side catalog — the one you call
.crossmatch(...) on (a in a.crossmatch(b, …)). Its rows drive
the match and its partitions drive the work; other is the right side
searched for counterparts.
The join semantics are two independent axes — how (the join type)
and maxmatch (the match multiplicity) — so every combination is
expressible:
how |
Unmatched anchors |
|---|---|
"inner" |
dropped (the default) |
"left" |
kept; right cols are NULL |
maxmatch |
One row per … |
|---|---|
1 |
anchor — the single nearest match (default) |
-1 |
matching pair within the radius (all matches) |
N >= 2 |
anchor × up to the N nearest within the radius |
So maxmatch=-1 is what you want in crowded fields (multiple
counterparts inside the radius), or whenever you intend to re-rank the
matches yourself — say, on a magnitude-similarity score, not on
nearness. how="left" ("LEFT XMATCH") keeps the population, not just
the matched pairs — "every Gaia source, with its 2MASS counterpart if
any." The two compose: how="left", maxmatch=-1 keeps every
counterpart and the unmatched anchors. (maxmatch=0, or any value
< -1, is a ValueError.)
The radius is inclusive — a counterpart at exactly radius is a
match. For maxmatch >= 2, each anchor's matches are ordered by
separation and ties at equal distance are broken deterministically (by
the right-side row's internal id), so re-running the same query gives
the same N. For the default maxmatch=1, exact-distance ties between
two nearest candidates are extremely rare at float64 precision and the
tie-break is not pinned; if it matters for your science, use
maxmatch=-1 and rank the matches yourself.
dist_col="d_arcsec" injects the great-circle separation, in
arcseconds (float64, nullable on unmatched LEFT rows), as a column
of that name. You can then filter or sort on it like any other column.
suffix="_2mass" overrides the default _<other.alias> suffix used to
disambiguate column-name collisions between the two sides.
nested=True collects each anchor's matches into per-row list<T>
columns — one row per anchor instead of one row per pair — the
light-curve / nested-catalog shape. It is covered under
nested crossmatch below.
The SQL form¶
The SQL equivalent is XMATCH(...) in a JOIN ... ON clause. The SQL
surface is older than the fluent how/maxmatch split: it expresses
the join type through the JOIN keyword (JOIN = inner, LEFT JOIN =
left) and the multiplicity through mode => 'nearest' (default) or
mode => 'all':
FROM a
JOIN b ON XMATCH(radius_arcsec => 1.0) -- nearest, inner
JOIN b ON XMATCH(r => 1.0) -- 'r' is an alias for radius_arcsec
JOIN b ON XMATCH(r => 1.0, mode => 'all') -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0) -- fluent how="left"
JOIN b ON XMATCH(r => 1.0, dist_col => 'd') -- dist_col
Fluent maxmatch ↔ SQL mode
The fluent maxmatch=1 / maxmatch=-1 map to SQL mode =>
'nearest' / mode => 'all'. The fluent "up to N nearest"
(maxmatch=N) has no SQL spelling — use the fluent surface for it.
XMATCH must be the entire ON predicate. Compound predicates like
ON XMATCH(...) AND a.mag < 20 are rejected with ValidationError; put
the extra term in a WHERE clause instead. RIGHT JOIN, FULL JOIN,
and CROSS JOIN are not supported either — swap roles and use a LEFT
join.
What the matcher actually does — astronomy correctness¶
Crossmatching looks easy and isn't. Here are the four behaviors astronomers have learned to check before trusting any new tool. ACID gets two of them right automatically and explicitly hands the other two back to you (with a hard error in one case, a documented limitation in the other).
1. Epoch — all RA/Dec are J2000, no exceptions¶
ACID treats every catalog's stored RA/Dec as J2000 (ICRS). It applies
no epoch propagation, no proper motion, no parallax — positions are used
as-is. There is no epoch / pmra / pmdec / apply_space_motion
keyword anywhere on crossmatch or XMATCH. The matcher converts
each table's stored RA/Dec to 3D unit vectors and matches on chord
distance (a kd-tree search).
The consequence: if you register a catalog whose positions are at a different epoch (e.g. Gaia DR3, which reports J2016.0), high-proper-motion sources will appear to have moved on the sky relative to your J2000 counterpart. At Gaia's J2016, Barnard's Star is 145″ from its J2000 position. A 1″ XMATCH between an as-published Gaia catalog and a J2000 survey will silently miss it.
The fix: propagate your positions to J2000 before registering the catalog with ACID. The easiest way is to propagate the positions with Astropy and open the propagated table directly as a virtual catalog (no HATS import):
import acid
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from astropy.time import Time
acid.init("catalogs.yaml", workers=8)
# `gaia` is an Astropy Table of Gaia rows (e.g. from a cone you pulled):
# columns source_id, ra, dec (deg), pmra, pmdec (mas/yr).
moved = SkyCoord(
ra=gaia["ra"] * u.deg, dec=gaia["dec"] * u.deg,
pm_ra_cosdec=gaia["pmra"] * u.mas / u.yr,
pm_dec=gaia["pmdec"] * u.mas / u.yr,
obstime=Time("J2016.0"),
).apply_space_motion(new_obstime=Time("J2000.0"))
# Build a J2000 table and open it directly as a virtual catalog:
gaia_2000 = Table({
"source_id": gaia["source_id"],
"ra_2000": moved.ra.deg,
"dec_2000": moved.dec.deg,
})
targets = acid.open(gaia_2000, ra="ra_2000", dec="dec_2000") # no offline HATS import
matched = targets.crossmatch(acid.open("twomass_psc"), radius=1 * u.arcsec)
See Bring your own target list for the virtual-catalog details.
If proper motions are unavailable, widen the XMATCH radius to absorb the largest expected offset and document the systematic. There is no fully correct workaround on the ACID side; the J2000-only convention is by design.
2. Radius vs. margin cache — rejected at analyze time, with the fix¶
Right-side catalogs in ACID carry a margin cache: a small sibling catalog of "boundary" rows that fall just outside each partition's own HEALPix pixel, written when the catalog is built. Without it, sources within the XMATCH radius of a partition boundary are silently missed (the matcher only sees rows inside the partition).
The rule: ACID rejects any crossmatch whose radius_arcsec
exceeds the right catalog's recorded neighbor_margin_arcsec, at
analyze time — before any data is read:
ValidationError: XMATCH radius_arcsec=5.0 exceeds <name>'s
neighbor_margin_arcsec=1.0; matches near partition boundaries would
be silently missed
hint: rebuild the margin cache at a larger radius, or shrink XMATCH
radius.
The fix — verbatim from the error's own hint:
- Shrink the XMATCH radius below
neighbor_margin_arcsec. If your physics doesn't care about the wider radius, take this path; it needs no I/O. -
Rebuild the margin cache at a larger radius with the CLI:
Then either re-point the registry at the new cache or overwrite the sibling
_margindirectory in place. See margin caches for the full builder reference.
If the right catalog has no margin cache at all the analyzer rejects the query earlier, with a different hint:
ValidationError: XMATCH right table <name> has no neighbor_path (margin
cache) configured
hint: build one with `acid hats build-margin <catalog>`.
Same fix: run acid hats build-margin on the catalog. Catalogs
downloaded with acid download come with a usable margin cache by
default unless --skip-margin was passed.
Why the analyzer rejects this rather than warning
Boundary-crossing matches are a silent failure mode: the query runs, returns rows, and looks right — only the pairs that straddled a partition boundary are missing. A hard rejection at analyze time is the only honest contract.
3. RA wrap and the poles — handled, nothing for you to do¶
A common worry when matching by RA/Dec is the prime meridian and the poles: RA = 359.99° and RA = 0.01° are 0.02° apart on the sky, not 359.98°; and near the poles, equal differences in RA correspond to shrinking arcs.
ACID matches on 3D unit vectors (chord distance), not on
RA/Dec Euclidean distance. RA wrap is automatically correct, the poles
are automatically correct, and you do not need to call
SkyCoord.wrap_at(...) or special-case anything in your query. A
crossmatch across the prime meridian, or a circumpolar match, looks
exactly like any other crossmatch.
4. Output units — arcseconds for dist_col, native units for everything else¶
The dist_col column is great-circle angular separation in
arcseconds (float64), no exceptions; for unmatched LEFT rows it is
NULL.
Every other column passes through unchanged from the source catalog's
on-disk representation. RA / Dec stay degrees ICRS. Magnitudes,
fluxes, IDs, flags, light-curve mjd timestamps are exactly what the
catalog stored — ACID does no unit conversion. If your downstream
code wants degrees of separation or milliarcseconds, do the conversion
explicitly (e.g. SELECT d_arcsec / 3600.0 AS d_deg, or after
.to_pandas()). See results.md for the full discussion.
Bring your own target list¶
A common workflow: "I have a CSV (or FITS table, or a pandas / Astropy table in memory) of target coordinates. Can I crossmatch it against a HATS catalog?"
Yes — directly. acid.open(...) accepts a raw data file or an
in-memory frame as a virtual catalog; you name the RA/Dec columns and
crossmatch it like any registered catalog. No offline HATS import.
import acid
import astropy.units as u
acid.init("catalogs.yaml", workers=8)
# targets.csv has columns RA, DEC (degrees, ICRS) plus whatever else.
targets = acid.open("targets.csv", ra="RA", dec="DEC")
gaia = acid.open("gaia_dr3")
near = targets.crossmatch(gaia, radius=1 * u.arcsec, dist_col="d_arcsec")
tbl = near.to_astropy()
Accepted file types: .parquet, .csv, .tsv, .fits,
.arrow / .feather, and VOTable (.vot / .xml).
import acid
import astropy.units as u
from astropy.table import Table
acid.init("catalogs.yaml", workers=8)
my_targets = Table({"name": [...], "RA": [...], "DEC": [...]}) # ICRS, degrees
targets = acid.open(my_targets, ra="RA", dec="DEC")
gaia = acid.open("gaia_dr3")
near = targets.crossmatch(gaia, radius=1 * u.arcsec).to_astropy()
Accepted frame types: NumPy structured array, pandas, polars,
pyarrow, and Astropy Table.
A virtual catalog is spilled once at open() to a single
memory-mapped Arrow file under the connection's scratch directory, and
cleaned up when the connection closes. A few rules to know:
ra=/dec=are required and never guessed — name the coordinate columns explicitly (degrees, ICRS). A frame with no resolvable coordinates is aValidationError; this on-ramp is spatial.- NULL / NaN-coordinate rows are dropped with a warning, so a stray blank row in your CSV won't silently produce bad matches.
- The virtual catalog can be the crossmatch root (as above) or an
operand (
gaia.crossmatch(acid.open("targets.csv", ra=..., dec=...), ...)), INNER or LEFT. - You still need a margin cache on the right-side catalog (the one
you're matching into) at a radius ≥ your match radius — both
acid downloadandacid hats build-marginproduce one. The virtual target catalog on the left does not need a margin cache.
Your target list only reads the survey where it overlaps
A virtual catalog is partitioned by sky position, so the crossmatch only touches the survey partitions your targets actually fall in — a few hundred targets scattered across the sky don't make you scan the whole right catalog.
The SQL escape hatch needs a registered name
acid.open(<file>) returns a fluent Catalog but does not put
a name in the registry, so acid.sql.query("... FROM <file> ...") can't
see it (an ad-hoc file can't shadow a configured catalog). To use a
raw file by name in SQL, register it with
acid.register_file(name, path, ra=..., dec=...) or, on
the CLI, acid query --open — see
Writing queries.
Worked examples¶
Nearest match within 1″¶
The default — one row per anchor source, the closest right-side source inside the radius, anchors with no match are dropped:
Every match within 5″ (crowded fields)¶
One row per pair within the radius:
Keep every anchor, matched or not (LEFT XMATCH)¶
how="left" keeps every anchor row and fills the right-side columns
with NULL where there is no match. Combine with a WHERE on the
distance column for an anti-join ("everything in a without a
counterpart in b"):
See filtering rows for the full IS NULL /
IS NOT NULL pattern.
Top-K closest matches¶
Pair .sort(...) + .limit(K) (fluent) or ORDER BY ... LIMIT K
(SQL). Both push the LIMIT to each partition, so the cost scales with
K, not with total matches:
Restrict to a survey footprint, then crossmatch¶
Catalog.in_region(<name>) restricts the catalog to a MOC footprint
before the matcher runs — typically another survey's footprint, given
as the name of a registered catalog (its point_map.fits is used).
Compose it with .crossmatch(...) and a magnitude cut to get e.g.
"every Gaia source inside DELVE's footprint, in the 14–18 mag range,
paired with its closest 2MASS counterpart within 1″":
The footprint restriction prunes whole partitions before any matching
work — see Sky regions & footprints for the
mechanics, and Filtering rows for how the magnitude
cut is pushed down via _healpix_29 row-group statistics.
Chain a second join after the crossmatch¶
The output of a crossmatch is itself a Catalog; you can chain
another XMATCH, an ordinary integer-ID join, or a fluent filter onto
it. The most common shape is "match by position, then attach light
curves by ID":
See Light curves for a list of targets for the recipe-shaped version of this pattern.
Nested crossmatch — one row per anchor¶
nested=True folds each anchor's matches into per-row list<T>
columns: one row per anchor, the anchor's own columns staying scalar,
every matched-side column becoming a list. This is the shape you want
when an anchor can have several counterparts and you want them grouped
by anchor rather than exploded into one row per pair — and it avoids
materializing the exploded pair table, which is what OOMs at survey
scale.
import acid
import astropy.units as u
acid.init("catalogs.yaml", workers=8)
# Each Gaia source, with the list of every 2MASS source within 2″,
# nearest-first by separation:
nested = (acid.open("gaia_dr3")
.crossmatch(acid.open("twomass_psc"),
radius=2 * u.arcsec, maxmatch=-1, dist_col="d",
nested=True, order_by="d"))
order_by= sorts the elements within each list consistently across
all list columns — element i of designation is the same 2MASS
source as element i of d. With how="left", an anchor with no
match gets empty lists ([], length 0), not a phantom [null].
There's no SQL spelling for nested crossmatch; it's fluent-only. See
Light curves for the equi-join version
(object ⋈ source) and Aggregating
for the single-catalog list fold.
Performance notes¶
A few things help crossmatches run fast on real catalogs.
- Tighten the radius. A 1″ match scans less margin and produces far fewer candidate pairs than a 10″ match. If you only need 1″, ask for 1″.
- Project early. A
.select(...)(fluent) or narrowSELECTlist (SQL) is pushed all the way down to parquet —acidonly reads the columns you actually need. Wide catalogs (150+ columns) do not slow a narrow query down. - Filter early.
.where("phot_g_mean_mag < 16")before.crossmatch(...)is a pre-filter: the matcher only sees passing rows. To pre-filter the right side, filter the operand (a.crossmatch(b.where("mag < 20"), radius=1 * u.arcsec)) — withmaxmatch=1this gives "nearest surviving match" semantics. - Anchor choice matters. The anchor (the catalog
.crossmatch(...)is called on) drives the work-tuple enumeration. If your right catalog is much larger and you only need rows in the smaller, put the smaller catalog as the anchor.
See Performance & parallelism for the cluster-side
levers (workers, threads, allocator tuning).
See also¶
- Working with results & exporting — what comes out of a
crossmatch, in what units, and how to write it to disk. - Filtering rows — the
where/WHEREcuts you apply before and after a crossmatch. - Aggregating — counting matched rows per anchor, per band, per pixel.
- Margin caches — what they are and how to (re)build them at the radius your crossmatch needs.
- Sky regions & footprints — restricting a crossmatch to a survey footprint.
- Light curves for a list of targets — the recipe shape for "Gaia × Rubin object table × Rubin lightcurves".