Skip to content

Crossmatching catalogs

You have two sky-survey catalogs and want to associate sources by sky position — every Gaia source within 1″ of a 2MASS source, every Rubin object inside 5″ of a known artifact, that kind of thing. This is the single most common thing astronomers reach for acid to do, and it is the first page of this guide for a reason.

This page covers:

  • How to write a crossmatch on either of the two surfaces — fluent (Catalog.crossmatch(...)) and SQL (XMATCH(...) in ON).
  • The four trust-builders the matcher does — and does not — handle: the J2000-only epoch convention, the radius-vs-margin-cache rule (and how to fix it), RA-wrap / poles, output units.
  • How to crossmatch a target list you bring yourself — a CSV, parquet, FITS, or an in-memory pandas/Astropy table — without an offline HATS import.

If you already know all of that and just want a recipe, jump to the examples at the bottom.

The shortest crossmatch

The two surfaces produce the same query; use whichever suits the rest of your code. Pick the test fixtures a and b (b is a shifted by 0.3″ in RA — the synthetic catalogs that come with the test suite); for real data, substitute gaia_dr3 and twomass_psc or whatever you have registered.

crossmatch.py
import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)
a = acid.open("a")
b = acid.open("b")
matched = a.crossmatch(b, radius=1 * u.arcsec, dist_col="d_arcsec")

df = matched.to_astropy()          # one call: execute + convert
crossmatch_sql.py
import acid

acid.init("catalogs.yaml", workers=8)
r = acid.sql.query("""
    SELECT a.*, b.*, d_arcsec
    FROM   a
    JOIN   b ON XMATCH(radius_arcsec => 1.0, dist_col => 'd_arcsec')
""")
df = r.to_polars()

In both cases the result has one row per matched pair, with every column from a and every column from b (right-side columns that clash with the left get a _<alias> suffix — id and id_b, ra and ra_b). The extra d_arcsec column carries the great-circle separation in arcseconds (see Output units).

Fluent vs. SQL — pick one

The fluent surface works with flat column names like id and id_b (the merged columns you see in the output), not a.id / b.id. The SQL surface still understands alias.col. Most queries on this page are shown in both; pick whichever shape your downstream code already uses. See the (future) "Verbs or SQL — which should I use?" guide for a decision aid.

crossmatch(...) — the verb in detail

Catalog.crossmatch takes one positional argument — the right-side catalog — and keyword arguments:

left.crossmatch(
    other,                           # a Catalog from acid.open(...)
    *,
    radius,                          # an astropy Quantity (1 * u.arcsec); bare float rejected
    how="inner",                     # "inner" | "left"  — the join type
    maxmatch=1,                      # 1 | -1 (all) | N>=2 — the match multiplicity
    dist_col=None,                   # "d_arcsec" surfaces the separation column
    suffix=None,                     # override the "_<alias>" collision suffix
    nested=False,                    # fold matches into per-row lists (see below)
    order_by=None,                   # element order inside the nested lists
)

radius must be an astropy.units.Quantity with an angular unit (1 * u.arcsec, 0.5 * u.deg). A bare number is rejected with a ValueError: the arcsec-vs-degree ambiguity is a 3600× footgun, so the unit is required. (The SQL surface, where the argument is literally named radius_arcsec, is unambiguous and takes a bare number.)

Throughout, the anchor is the left-side catalog — the one you call .crossmatch(...) on (a in a.crossmatch(b, …)). Its rows drive the match and its partitions drive the work; other is the right side searched for counterparts.

The join semantics are two independent axeshow (the join type) and maxmatch (the match multiplicity) — so every combination is expressible:

how Unmatched anchors
"inner" dropped (the default)
"left" kept; right cols are NULL
maxmatch One row per …
1 anchor — the single nearest match (default)
-1 matching pair within the radius (all matches)
N >= 2 anchor × up to the N nearest within the radius

So maxmatch=-1 is what you want in crowded fields (multiple counterparts inside the radius), or whenever you intend to re-rank the matches yourself — say, on a magnitude-similarity score, not on nearness. how="left" ("LEFT XMATCH") keeps the population, not just the matched pairs — "every Gaia source, with its 2MASS counterpart if any." The two compose: how="left", maxmatch=-1 keeps every counterpart and the unmatched anchors. (maxmatch=0, or any value < -1, is a ValueError.)

The radius is inclusive — a counterpart at exactly radius is a match. For maxmatch >= 2, each anchor's matches are ordered by separation and ties at equal distance are broken deterministically (by the right-side row's internal id), so re-running the same query gives the same N. For the default maxmatch=1, exact-distance ties between two nearest candidates are extremely rare at float64 precision and the tie-break is not pinned; if it matters for your science, use maxmatch=-1 and rank the matches yourself.

dist_col="d_arcsec" injects the great-circle separation, in arcseconds (float64, nullable on unmatched LEFT rows), as a column of that name. You can then filter or sort on it like any other column.

suffix="_2mass" overrides the default _<other.alias> suffix used to disambiguate column-name collisions between the two sides.

nested=True collects each anchor's matches into per-row list<T> columns — one row per anchor instead of one row per pair — the light-curve / nested-catalog shape. It is covered under nested crossmatch below.

The SQL form

The SQL equivalent is XMATCH(...) in a JOIN ... ON clause. The SQL surface is older than the fluent how/maxmatch split: it expresses the join type through the JOIN keyword (JOIN = inner, LEFT JOIN = left) and the multiplicity through mode => 'nearest' (default) or mode => 'all':

FROM      a
JOIN      b   ON XMATCH(radius_arcsec => 1.0)                       -- nearest, inner
JOIN      b   ON XMATCH(r => 1.0)                                   -- 'r' is an alias for radius_arcsec
JOIN      b   ON XMATCH(r => 1.0, mode => 'all')                    -- every match within r
LEFT JOIN b   ON XMATCH(r => 1.0)                                   -- fluent how="left"
JOIN      b   ON XMATCH(r => 1.0, dist_col => 'd')                  -- dist_col

Fluent maxmatch ↔ SQL mode

The fluent maxmatch=1 / maxmatch=-1 map to SQL mode => 'nearest' / mode => 'all'. The fluent "up to N nearest" (maxmatch=N) has no SQL spelling — use the fluent surface for it.

XMATCH must be the entire ON predicate. Compound predicates like ON XMATCH(...) AND a.mag < 20 are rejected with ValidationError; put the extra term in a WHERE clause instead. RIGHT JOIN, FULL JOIN, and CROSS JOIN are not supported either — swap roles and use a LEFT join.

What the matcher actually does — astronomy correctness

Crossmatching looks easy and isn't. Here are the four behaviors astronomers have learned to check before trusting any new tool. ACID gets two of them right automatically and explicitly hands the other two back to you (with a hard error in one case, a documented limitation in the other).

1. Epoch — all RA/Dec are J2000, no exceptions

ACID treats every catalog's stored RA/Dec as J2000 (ICRS). It applies no epoch propagation, no proper motion, no parallax — positions are used as-is. There is no epoch / pmra / pmdec / apply_space_motion keyword anywhere on crossmatch or XMATCH. The matcher converts each table's stored RA/Dec to 3D unit vectors and matches on chord distance (a kd-tree search).

The consequence: if you register a catalog whose positions are at a different epoch (e.g. Gaia DR3, which reports J2016.0), high-proper-motion sources will appear to have moved on the sky relative to your J2000 counterpart. At Gaia's J2016, Barnard's Star is 145″ from its J2000 position. A 1″ XMATCH between an as-published Gaia catalog and a J2000 survey will silently miss it.

The fix: propagate your positions to J2000 before registering the catalog with ACID. The easiest way is to propagate the positions with Astropy and open the propagated table directly as a virtual catalog (no HATS import):

propagate_to_J2000.py
import acid
import astropy.units as u
from astropy.coordinates import SkyCoord
from astropy.table import Table
from astropy.time import Time

acid.init("catalogs.yaml", workers=8)

# `gaia` is an Astropy Table of Gaia rows (e.g. from a cone you pulled):
# columns source_id, ra, dec (deg), pmra, pmdec (mas/yr).
moved = SkyCoord(
    ra=gaia["ra"] * u.deg, dec=gaia["dec"] * u.deg,
    pm_ra_cosdec=gaia["pmra"] * u.mas / u.yr,
    pm_dec=gaia["pmdec"]      * u.mas / u.yr,
    obstime=Time("J2016.0"),
).apply_space_motion(new_obstime=Time("J2000.0"))

# Build a J2000 table and open it directly as a virtual catalog:
gaia_2000 = Table({
    "source_id": gaia["source_id"],
    "ra_2000":  moved.ra.deg,
    "dec_2000": moved.dec.deg,
})
targets = acid.open(gaia_2000, ra="ra_2000", dec="dec_2000")  # no offline HATS import

matched = targets.crossmatch(acid.open("twomass_psc"), radius=1 * u.arcsec)

See Bring your own target list for the virtual-catalog details.

If proper motions are unavailable, widen the XMATCH radius to absorb the largest expected offset and document the systematic. There is no fully correct workaround on the ACID side; the J2000-only convention is by design.

2. Radius vs. margin cache — rejected at analyze time, with the fix

Right-side catalogs in ACID carry a margin cache: a small sibling catalog of "boundary" rows that fall just outside each partition's own HEALPix pixel, written when the catalog is built. Without it, sources within the XMATCH radius of a partition boundary are silently missed (the matcher only sees rows inside the partition).

The rule: ACID rejects any crossmatch whose radius_arcsec exceeds the right catalog's recorded neighbor_margin_arcsec, at analyze time — before any data is read:

ValidationError: XMATCH radius_arcsec=5.0 exceeds <name>'s
neighbor_margin_arcsec=1.0; matches near partition boundaries would
be silently missed
hint: rebuild the margin cache at a larger radius, or shrink XMATCH
radius.

The fix — verbatim from the error's own hint:

  1. Shrink the XMATCH radius below neighbor_margin_arcsec. If your physics doesn't care about the wider radius, take this path; it needs no I/O.
  2. Rebuild the margin cache at a larger radius with the CLI:

    acid hats build-margin /data/<catalog> --margin-arcsec 10.0
    

    Then either re-point the registry at the new cache or overwrite the sibling _margin directory in place. See margin caches for the full builder reference.

If the right catalog has no margin cache at all the analyzer rejects the query earlier, with a different hint:

ValidationError: XMATCH right table <name> has no neighbor_path (margin
cache) configured
hint: build one with `acid hats build-margin <catalog>`.

Same fix: run acid hats build-margin on the catalog. Catalogs downloaded with acid download come with a usable margin cache by default unless --skip-margin was passed.

Why the analyzer rejects this rather than warning

Boundary-crossing matches are a silent failure mode: the query runs, returns rows, and looks right — only the pairs that straddled a partition boundary are missing. A hard rejection at analyze time is the only honest contract.

3. RA wrap and the poles — handled, nothing for you to do

A common worry when matching by RA/Dec is the prime meridian and the poles: RA = 359.99° and RA = 0.01° are 0.02° apart on the sky, not 359.98°; and near the poles, equal differences in RA correspond to shrinking arcs.

ACID matches on 3D unit vectors (chord distance), not on RA/Dec Euclidean distance. RA wrap is automatically correct, the poles are automatically correct, and you do not need to call SkyCoord.wrap_at(...) or special-case anything in your query. A crossmatch across the prime meridian, or a circumpolar match, looks exactly like any other crossmatch.

4. Output units — arcseconds for dist_col, native units for everything else

The dist_col column is great-circle angular separation in arcseconds (float64), no exceptions; for unmatched LEFT rows it is NULL.

Every other column passes through unchanged from the source catalog's on-disk representation. RA / Dec stay degrees ICRS. Magnitudes, fluxes, IDs, flags, light-curve mjd timestamps are exactly what the catalog stored — ACID does no unit conversion. If your downstream code wants degrees of separation or milliarcseconds, do the conversion explicitly (e.g. SELECT d_arcsec / 3600.0 AS d_deg, or after .to_pandas()). See results.md for the full discussion.

Bring your own target list

A common workflow: "I have a CSV (or FITS table, or a pandas / Astropy table in memory) of target coordinates. Can I crossmatch it against a HATS catalog?"

Yes — directly. acid.open(...) accepts a raw data file or an in-memory frame as a virtual catalog; you name the RA/Dec columns and crossmatch it like any registered catalog. No offline HATS import.

byo_targets_file.py
import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)

# targets.csv has columns RA, DEC (degrees, ICRS) plus whatever else.
targets = acid.open("targets.csv", ra="RA", dec="DEC")
gaia    = acid.open("gaia_dr3")

near = targets.crossmatch(gaia, radius=1 * u.arcsec, dist_col="d_arcsec")
tbl  = near.to_astropy()

Accepted file types: .parquet, .csv, .tsv, .fits, .arrow / .feather, and VOTable (.vot / .xml).

byo_targets_frame.py
import acid
import astropy.units as u
from astropy.table import Table

acid.init("catalogs.yaml", workers=8)

my_targets = Table({"name": [...], "RA": [...], "DEC": [...]})  # ICRS, degrees

targets = acid.open(my_targets, ra="RA", dec="DEC")
gaia    = acid.open("gaia_dr3")

near = targets.crossmatch(gaia, radius=1 * u.arcsec).to_astropy()

Accepted frame types: NumPy structured array, pandas, polars, pyarrow, and Astropy Table.

A virtual catalog is spilled once at open() to a single memory-mapped Arrow file under the connection's scratch directory, and cleaned up when the connection closes. A few rules to know:

  • ra= / dec= are required and never guessed — name the coordinate columns explicitly (degrees, ICRS). A frame with no resolvable coordinates is a ValidationError; this on-ramp is spatial.
  • NULL / NaN-coordinate rows are dropped with a warning, so a stray blank row in your CSV won't silently produce bad matches.
  • The virtual catalog can be the crossmatch root (as above) or an operand (gaia.crossmatch(acid.open("targets.csv", ra=..., dec=...), ...)), INNER or LEFT.
  • You still need a margin cache on the right-side catalog (the one you're matching into) at a radius ≥ your match radius — both acid download and acid hats build-margin produce one. The virtual target catalog on the left does not need a margin cache.

Your target list only reads the survey where it overlaps

A virtual catalog is partitioned by sky position, so the crossmatch only touches the survey partitions your targets actually fall in — a few hundred targets scattered across the sky don't make you scan the whole right catalog.

The SQL escape hatch needs a registered name

acid.open(<file>) returns a fluent Catalog but does not put a name in the registry, so acid.sql.query("... FROM <file> ...") can't see it (an ad-hoc file can't shadow a configured catalog). To use a raw file by name in SQL, register it with acid.register_file(name, path, ra=..., dec=...) or, on the CLI, acid query --open — see Writing queries.

Worked examples

Nearest match within 1″

The default — one row per anchor source, the closest right-side source inside the radius, anchors with no match are dropped:

near = a.crossmatch(b, radius=1 * u.arcsec, dist_col="d")
df = near.to_polars()
SELECT a.id, b.id AS b_id, d
FROM   a
JOIN   b ON XMATCH(radius_arcsec => 1.0, dist_col => 'd')

Every match within 5″ (crowded fields)

One row per pair within the radius:

every = a.crossmatch(b, radius=5 * u.arcsec, maxmatch=-1, dist_col="d")
SELECT a.id, b.id AS b_id, d
FROM   a
JOIN   b ON XMATCH(radius_arcsec => 5.0, mode => 'all', dist_col => 'd')

Keep every anchor, matched or not (LEFT XMATCH)

how="left" keeps every anchor row and fills the right-side columns with NULL where there is no match. Combine with a WHERE on the distance column for an anti-join ("everything in a without a counterpart in b"):

anti = (a
        .crossmatch(b, radius=1 * u.arcsec, how="left", dist_col="d")
        .where("d IS NULL"))
# how="left" keeps unmatched anchors; the WHERE keeps only those.
SELECT a.*
FROM      a
LEFT JOIN b ON XMATCH(radius_arcsec => 1.0, dist_col => 'd')
WHERE     d IS NULL

See filtering rows for the full IS NULL / IS NOT NULL pattern.

Top-K closest matches

Pair .sort(...) + .limit(K) (fluent) or ORDER BY ... LIMIT K (SQL). Both push the LIMIT to each partition, so the cost scales with K, not with total matches:

closest = (a
           .crossmatch(b, radius=5 * u.arcsec, dist_col="d")
           .sort("d")
           .limit(100))
SELECT a.id, b.id AS b_id, d
FROM   a
JOIN   b ON XMATCH(radius_arcsec => 5.0, dist_col => 'd')
ORDER BY d ASC
LIMIT 100

Restrict to a survey footprint, then crossmatch

Catalog.in_region(<name>) restricts the catalog to a MOC footprint before the matcher runs — typically another survey's footprint, given as the name of a registered catalog (its point_map.fits is used). Compose it with .crossmatch(...) and a magnitude cut to get e.g. "every Gaia source inside DELVE's footprint, in the 14–18 mag range, paired with its closest 2MASS counterpart within 1″":

gaia    = acid.open("gaia_dr3")
twomass = acid.open("twomass_psc")

inside = (gaia.in_region("delve")
          .crossmatch(twomass, radius=1 * u.arcsec, dist_col="d")
          .where("phot_g_mean_mag BETWEEN 14 AND 18")
          .to_polars())
SELECT g.source_id, t.designation, d
FROM   gaia_dr3      AS g
JOIN   twomass_psc   AS t ON XMATCH(radius_arcsec => 1.0, dist_col => 'd')
WHERE  IN_MOC(g, 'delve')
   AND g.phot_g_mean_mag BETWEEN 14 AND 18

The footprint restriction prunes whole partitions before any matching work — see Sky regions & footprints for the mechanics, and Filtering rows for how the magnitude cut is pushed down via _healpix_29 row-group statistics.

Chain a second join after the crossmatch

The output of a crossmatch is itself a Catalog; you can chain another XMATCH, an ordinary integer-ID join, or a fluent filter onto it. The most common shape is "match by position, then attach light curves by ID":

object_lc = (acid.open("a")
               .crossmatch(acid.open("b"), radius=1 * u.arcsec, dist_col="d")
               .join(acid.open("lc"), on=("id", "object_id")))
SELECT a.id, b.id AS b_id, d, lc.time, lc.mag
FROM   a
JOIN   b  ON XMATCH(radius_arcsec => 1.0, dist_col => 'd')
JOIN   lc ON a.id = lc.object_id

See Light curves for a list of targets for the recipe-shaped version of this pattern.

Nested crossmatch — one row per anchor

nested=True folds each anchor's matches into per-row list<T> columns: one row per anchor, the anchor's own columns staying scalar, every matched-side column becoming a list. This is the shape you want when an anchor can have several counterparts and you want them grouped by anchor rather than exploded into one row per pair — and it avoids materializing the exploded pair table, which is what OOMs at survey scale.

import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)

# Each Gaia source, with the list of every 2MASS source within 2″,
# nearest-first by separation:
nested = (acid.open("gaia_dr3")
          .crossmatch(acid.open("twomass_psc"),
                      radius=2 * u.arcsec, maxmatch=-1, dist_col="d",
                      nested=True, order_by="d"))

order_by= sorts the elements within each list consistently across all list columns — element i of designation is the same 2MASS source as element i of d. With how="left", an anchor with no match gets empty lists ([], length 0), not a phantom [null]. There's no SQL spelling for nested crossmatch; it's fluent-only. See Light curves for the equi-join version (object ⋈ source) and Aggregating for the single-catalog list fold.

Performance notes

A few things help crossmatches run fast on real catalogs.

  • Tighten the radius. A 1″ match scans less margin and produces far fewer candidate pairs than a 10″ match. If you only need 1″, ask for 1″.
  • Project early. A .select(...) (fluent) or narrow SELECT list (SQL) is pushed all the way down to parquet — acid only reads the columns you actually need. Wide catalogs (150+ columns) do not slow a narrow query down.
  • Filter early. .where("phot_g_mean_mag < 16") before .crossmatch(...) is a pre-filter: the matcher only sees passing rows. To pre-filter the right side, filter the operand (a.crossmatch(b.where("mag < 20"), radius=1 * u.arcsec)) — with maxmatch=1 this gives "nearest surviving match" semantics.
  • Anchor choice matters. The anchor (the catalog .crossmatch(...) is called on) drives the work-tuple enumeration. If your right catalog is much larger and you only need rows in the smaller, put the smaller catalog as the anchor.

See Performance & parallelism for the cluster-side levers (workers, threads, allocator tuning).

See also