Quickstart¶
In five minutes, you'll download a real HATS catalog, open it, and run a crossmatch — no theory, no config files.
You'll need acid installed; see Installation.
1. Download a small catalog¶
Grab a cone-shaped subset of the 2MASS point-source catalog from the
LSDB mirror — about 100,000 sources around (RA, Dec) = (50°, -50°):
A bare catalog name is resolved against the LSDB mirror, and — since you
didn't say where to put it — downloaded into ~/datasets/two_mass,
acid's default catalog directory (created for you on first use). That
default is why acid.open(...) below needs no setup: with nothing else
set, both ends look in ~/datasets.
~/datasets/two_mass/ now contains a valid HATS catalog. You can poke
at it the same way — by name:
2. Open a catalog¶
Just import acid and call acid.open(...). The worker pool spins up
on the first query and is shared across everything you run; you don't
manage it:
import acid
twomass = acid.open("two_mass")
print(twomass.columns) # cheap; reads cached metadata only
print(twomass.describe()) # row count, partitions, footprint, schema
With no configuration, acid searches your catalog path (ACID_PATH,
defaulting to ~/datasets) — the same place the download just landed —
so acid.open("two_mass") finds it by name.
acid.open("two_mass") returns a lazy Catalog handle. No data is read
yet — composition is metadata-only.
3. Filter, project, materialize¶
Compose with verbs, then trigger execution with a terminal method
(.head, .to_astropy, .to_polars, ...):
bright = (acid.open("two_mass")
.where("j_m < 14.0")
.select("designation, ra, decl, j_m")
.limit(10))
bright.head(10).show() # pretty-print to stdout
tbl = bright.to_astropy() # also: .to_polars(), .to_arrow(), .to_pandas()
The query is plain SQL inside .where(...) and .select(...).
acid reads only the columns you asked for from disk, even though the
catalog has 60 more.
Result.show(n) uses the same fixed-width renderer the CLI does, so the
output matches acid query "...". print(result) instead renders the
result as a Polars DataFrame — with its shape: header and Polars's
own row truncation.
4. Crossmatch two catalogs¶
Download a second catalog over the same region:
On the mirror this catalog's collection is named gaia_dr3; the second
argument stores it locally as plain gaia (a bare name → ~/datasets/gaia),
so we can refer to it as gaia from here on. (two_mass above needed only
one argument — its collection name was already the name we wanted.)
Not sure what's out there? acid search lists catalogs you can download from
the online mirrors, and acid list shows the ones already on your machine.
Now ask: "for every Gaia source, find any 2MASS source within 1 arcsec":
import acid
import astropy.units as u
gaia = acid.open("gaia")
twomass = acid.open("two_mass")
matches = (gaia.crossmatch(twomass, radius=1*u.arcsec)
.select("source_id, designation"))
matches.head(20).show()
Three things to note:
radius=1*u.arcsecis the only non-standard bit. Quantities are required — bare floats are rejected so units never get guessed wrong.- By default you get the single closest match per anchor row
(
maxmatch=1); passmaxmatch=-1for every match within the radius, andhow="left"to keep anchors that have no match. - You don't normally set the worker count —
acidsizes the pool to your machine automatically. Reach forworkers=Nonly to fix a problem, e.g. drop it if a query runs out of memory (fewer workers, more headroom each).
5. Restrict to a region of sky¶
If you only care about a small part of the sky — to debug a query
before running it full-sky — run it inside an acid.in_cone(...)
block. The cone is applied when a query executes inside the block,
to every query (fluent and SQL) materialized there. The same Catalog
object runs scoped inside the block and full-sky outside it:
gaia = acid.open("gaia")
twomass = acid.open("two_mass")
matches = (gaia.crossmatch(twomass, radius=1*u.arcsec)
.select("source_id, designation"))
with acid.in_cone((50.0, -50.0), radius=0.5*u.deg):
# Iterate cheaply: only partitions overlapping the cone are read.
small_tbl = matches.to_astropy()
# Outside the block, the *same* query runs full-sky:
big_tbl = matches.to_astropy()
acid enumerates only partitions overlapping the cone (skipping the
rest), then enforces dist ≤ radius exactly via a great-circle
predicate — no boundary artefacts. Cones do not nest; one block at a
time (a nested in_cone raises ValidationError).
6. Save or export the result¶
A pipeline ends in one of two terminal verbs — save for a result that
stays queryable, export for one that leaves as a single file.
save — a HATS catalog you can reuse and hand off to LSDB. A bare name
joins your catalog library (it lands under your ACID_PATH root), so later
sessions re-open it by name:
gaia = acid.open("gaia")
twomass = acid.open("two_mass")
saved = (gaia
.crossmatch(twomass, radius=1*u.arcsec)
.select("source_id, designation")
.save("gxt")) # → <ACID_PATH>/gxt, registered as "gxt"
# `saved` is a normal Catalog handle bound to the freshly written tree.
# In this *and* future sessions, the name resolves by acid.sql / acid.open:
print(acid.sql.query("SELECT COUNT(*) AS n FROM gxt"))
The written tree is a standards-compliant HATS catalog — any HATS reader
(including LSDB) opens it directly. save streams partition by partition, so
it scales to full-sky outputs. (Pass an explicit path like
./out/gaia_x_2mass to write somewhere specific instead of the library.)
export — one flat file for another tool. For a target list or paper
table, export writes a single CSV / parquet / FITS file (format by
extension or format=) and returns its path:
path = (gaia
.crossmatch(twomass, radius=1*u.arcsec)
.select("source_id, designation")
.export("crossmatch.csv"))
export gathers the whole result in memory before writing — perfect for
selective queries, but use save (streaming) for anything full-sky.
See Working with results & exporting for the full output-format menu.
7. Drop into SQL when you need to¶
The fluent verbs cover crossmatches, joins, filters, projections,
and aggregations (group_by / aggregate — see the
aggregation guide). But some queries just read
better as SQL, and a few shapes are SQL-only — when that's the case,
hand the same connection a plain SQL string with acid.sql.query(...):
r = acid.sql.query("""
SELECT g.source_id,
COUNT(*) AS n,
AVG(d) AS avg_d
FROM gaia AS g
JOIN two_mass AS t ON XMATCH(radius_arcsec => 1.0, mode => 'all', dist_col => 'd')
GROUP BY g.source_id
HAVING COUNT(*) >= 2
ORDER BY avg_d ASC
LIMIT 100
""")
print(r)
Where to next?¶
You've already seen the core API: acid.open(...), the verbs,
acid.in_cone(...), acid.sql.query(...). Three good directions:
- Your first crossmatch — a notebook that turns the example above into a small science story with plots.
- Cookbook — short self-contained recipes for the patterns you'll hit on real data (footprint filtering, self-crossmatch, top-K, materialize-and-reuse, ...).
- Connections and
Writing queries — the lifecycle of a
Connection, when to useacid.sql.query(...)vs the fluent verbs, and the full SQL dialect ACID supports.
Picking an output type
.to_astropy() returns an astropy Table — the natural fit for
catalog work (units, coordinates, FITS round-trips). For anything
heavy — group-by, filtering, joins on multi-million-row results —
.to_polars() is typically 5–50× faster than pandas and just as
easy to read. .to_pandas() exists if a downstream library needs it,
but reach for .to_astropy() or .to_polars() first.
And to hear when a new version lands, subscribe to release announcements — low traffic, one email per release.