Concepts

This page is the mental model behind ACID. Read it once and the rest of the documentation will make sense. You don't need to memorise every term — the Glossary has the definitions for later reference.

The big picture

A modern sky-survey catalog is a huge table of sources — typically hundreds of millions to billions of rows, each one a star, galaxy, or transient detection. Loading the whole thing into memory is hopeless; even reading it sequentially from disk takes hours.

The trick that makes catalogs of this size tractable is partitioning the sky into tiles. The catalog is split into thousands of files, each one holding the sources in one sky tile. A query that only touches a small patch of sky reads only the partitions that overlap that patch. A crossmatch between two catalogs only has to compare partitions that overlap on the sky. The work falls out parallel.

acid is the engine that does this for you. You write SQL that talks about catalogs as if they were ordinary tables; acid walks the partitions, runs each one in parallel, and stitches the results back together.

The rest of this page explains the words you'll see throughout the docs: HEALPix, HATS, crossmatch, margin cache, MOC.

HEALPix in one paragraph

HEALPix is a scheme for tiling the sphere with equal-area pixels. It is parameterised by an integer Norder controlling how fine the tiling is: at order 0 the sky is divided into 12 pixels; each step up in order divides every pixel into 4. By order 5 the sphere is 12 288 pixels (each about 3.4° across); by order 10 it is ~12 million (each about 6 arcmin); by order 29 it is fine enough to address individual arcsecond-scale positions. A pixel within a given order is identified by an integer Npix. Together (Norder, Npix) names one sky tile uniquely.

HEALPix subdivision: each step in order divides every pixel into four.

You don't do HEALPix arithmetic yourself — acid handles that internally. What matters is that a catalog partition corresponds to one (Norder, Npix) pixel, and that finer orders chop the sky into smaller pieces.

What does it look like on the sphere?

For figures of the HEALPix tiling laid out on an actual sphere (the canonical "12 base pixels wrap the globe" picture), see the HEALPix project page and its linked documentation.

HATS — how catalogs sit on disk

HATS (the Hierarchical Adaptive Tile Storage format) is the on-disk layout used by LSDB, hats-import, Gaia DR3, Rubin DP1, DES DR2, 2MASS, DELVE, SkyMapper, ZTF, and most other modern sky surveys. A HATS catalog is a directory laid out (simplified) like this:

gaia_dr3/
├── properties              metadata: column names, healpix order, …
├── partition_info.csv      which (Norder, Npix) partitions exist
├── point_map.fits          sky-density footprint (used by IN_MOC)
└── dataset/
    ├── Norder=5/Dir=0/Npix=42.parquet
    ├── Norder=5/Dir=0/Npix=43.parquet
    ├── Norder=5/Dir=0/Npix=44.parquet
    └── ...

Each file is an Apache Parquet file — a columnar binary format that lets acid read only the columns you ask for and skip large blocks based on filter values. Each file holds the rows for one HEALPix pixel. acid auto-discovers every catalog in a directory and figures out the schema, the partitioning, and which columns hold RA / Dec / healpix ID from properties. You almost never need to write a config file.

In practice, almost all real HATS catalogs use adaptive Norder — dense regions of sky (the Galactic plane, the Magellanic Clouds) are partitioned at a finer order than sparse ones (the high Galactic latitudes), so each partition has roughly the same number of rows and roughly the same processing cost. acid handles this automatically; queries look the same whether the catalog is uniform-Norder or (as is almost always the case) adaptive.

Sphere-and-partition diagram coming

A figure showing how an adaptive HATS tiling lays out on the sphere — coarser tiles at the poles, finer tiles in the Galactic plane — will go here. For now, the LSDB documentation has good examples of HATS catalogs visualised in this way; see the figures under docs.lsdb.io.

Crossmatching catalogs

A crossmatch is the astronomer's term for "find pairs of sources from two catalogs whose sky positions agree to within some angular distance". It is the bread-and-butter operation when comparing surveys — which Gaia stars have a 2MASS counterpart? which Rubin detections match a known DES galaxy?

Without partitioning, a naïve crossmatch is O(N × M) — every source in catalog A compared to every source in catalog B. For 10⁹ × 10⁹ sources that is impossible. With partitioning the work is local: sources in pixel P of catalog A only need to be compared with sources in pixel P (and its immediate neighbours) of catalog B. The total work drops by orders of magnitude.

acid does this with one SQL extension:

SELECT a.source_id, b.designation
FROM   gaia    AS a
JOIN   two_mass AS b ON XMATCH(radius_arcsec => 1.0)

XMATCH(radius_arcsec => 1.0) is the only thing new. It tells acid: "match a to b by sky position, with a 1-arcsec radius." Two modes are available:

  • mode => 'nearest' (default) — return the single closest match within the radius (if any).
  • mode => 'all' — return every match within the radius.

The match distance for any matched pair is available as XMATCH_DISTANCE(b), in arcseconds. You can put it in SELECT, WHERE, or ORDER BY.

Writing queries covers the syntax in detail.

The boundary-crossing problem (and margin caches)

There is a subtlety. If a source in catalog A sits near the edge of its partition, its true match in catalog B might live inside a neighbouring partition — which is a different parquet file.

A naïve partition-by-partition crossmatch would miss this match silently. Wrong answers are worse than slow answers.

        catalog A pixel             catalog B's neighbouring pixel
       ┌──────────────────┐         ┌──────────────────┐
       │                  │         │                  │
       │            ★ ────┼── 0.8″ ─┼─── ●             │
       │            A's   │         │   B's match      │
       │            source│         │                  │
       └──────────────────┘         └──────────────────┘
                          partition boundary

The fix that HATS uses is a margin cache: a companion catalog that stores, for each partition, an extra "border" of rows from the neighbouring partitions out to some angular distance (typically a few arcsec to tens of arcsec). When acid runs a crossmatch it reads the partition's data plus its margin, so any match that falls within the margin radius is found regardless of which side of the boundary it sits on.

The margin cache typically ships alongside the catalog as a sibling directory (e.g. gaia_dr3/ plus gaia_dr3_margin/). acid discovers it automatically. If you ever ask for a crossmatch radius larger than the margin cache supports, the query is rejected explicitly — silent boundary errors are not a thing.

For catalogs that don't come with a margin cache (or whose cache is too narrow for your radius), acid hats build-margin builds one locally.

MOC footprints

A MOC (Multi-Order Coverage map) is a compact way to store an arbitrary region of the sky as a set of HEALPix pixels at possibly different orders. Surveys publish MOCs describing their footprints ("all the sky DES observed in the DR2 release"). You can also build MOCs for regions you care about — a science field, a known- artifact mask, the intersection of two surveys.

acid has a built-in predicate, IN_MOC(<alias>, '<name>'), that restricts a query to a named MOC:

SELECT a.source_id, a.ra, a.dec
FROM   gaia AS a
WHERE  IN_MOC(a, 'des_dr2_footprint')

You register MOCs once per session, or once in a YAML config — see Catalogs and the registry. If a catalog ships a point_map.fits footprint (and most modern HATS catalogs do), acid will auto-load it as a MOC named after the catalog, so IN_MOC(a, 'two_mass') works out of the box on the 2MASS HATS catalog.

NOT IN_MOC(...) works too — useful for masking known artifacts or excluding one survey's footprint from another. MOC footprints covers the details.

Putting it together

The mental model in four bullets:

  1. Catalogs are directories of parquet files, one per HEALPix pixel. ACID reads (and writes) HATS catalogs natively.
  2. Crossmatching is partition-local, with margin caches to handle boundary-crossing matches correctly. This allows ACID to work on many tiles at once, in parallel.
  3. MOC predicates restrict queries to regions of sky, pruning whole partitions when their pixels don't overlap.
  4. You write SQL. The astronomy-specific extensions are XMATCH(radius_arcsec => …) for crossmatching, XMATCH_DISTANCE(<alias>) for the match distance, and IN_MOC(<alias>, '<name>') for footprint filtering.

That's the whole concept stack. The rest of the user guide is just applying it.