Catalogs and the registry¶
A Connection knows about a set of catalogs — HATS-shaped
directories on local disk — and a set of MOCs — sky-region
predicates you can use in WHERE IN_MOC(...) or Catalog.in_region(...).
Together they form the registry. This page covers the three ways to
populate it, the auto-detection rules, and the per-catalog fields you
can override.
The three sources¶
You pick one (or pass several at once) when you call acid.init(...):
Every immediate subdirectory of /data/hats that contains a
properties or hats.properties file is treated as a HATS
catalog and is opened by name with acid.open("subdir_name").
Margin caches (whose properties declare
dataproduct_type = margin) are skipped as user-visible
catalogs — they're attached to their parent catalog instead.
A YAML config gives you explicit names, optional overrides, and a place to register MOCs in the same file. The full grammar is in The YAML registry below.
Multiple directories, searched in order for acid.open(name).
Leftmost match wins.
acid.init(...) is optional — the first acid.open() / acid.sql.query()
lazy-inits a default connection — but call it when you want to pin the
source (or workers, threads, …) up front. You can also start empty
and add catalogs later:
acid.init([])
acid.register_catalog("gaia_dr3", path="/data/gaia_dr3")
acid.register_catalog("targets", path="/scratch/my_targets")
Raw files and in-memory frames (virtual catalogs)¶
acid.open(...) also accepts a raw data file or an in-memory
frame directly — a virtual catalog. This is the on-ramp for a
target list that isn't a HATS catalog, with no offline import:
import acid
acid.init("/data/hats")
# A file on disk: .parquet / .csv / .tsv / .fits / .arrow / .feather / VOTable
targets = acid.open("targets.csv", ra="RA", dec="DEC")
# Or an in-memory frame: NumPy structured array / pandas / polars / pyarrow / Astropy
import pandas as pd
targets = acid.open(pd.read_csv("targets.csv"), ra="RA", dec="DEC")
The source is spilled once at open() to a single memory-mapped Arrow
file under the connection's scratch directory (removed on close()),
adaptively partitioned by sky density, and from then on behaves like an
ordinary (if coarse) catalog — usable as a crossmatch root or operand.
A few rules:
ra=/dec=are required and never guessed (degrees, ICRS); NULL/NaN-coordinate rows are dropped with a warning.acid.open(<file>)does not register a name, so the SQL escape hatch can't see it. To use a raw file by name inacid.sql.query(...), register it withacid.register_file(name, path, ra=…, dec=…)(or, on the CLI,acid query --open).
See bring your own target list for the crossmatch workflow and the full set of accepted types.
Auto-detected catalog metadata¶
For each registered catalog, acid reads:
properties(orhats.properties) — forhats_col_ra,hats_col_dec,hats_col_healpix,hats_col_healpix_order,hats_margin_threshold, etc.partition_info.csv— for the actual(Norder, Npix)partition list and the maximumNorder(the catalog'shpix_order).- A
margin_cache/subdirectory or*_margin*sibling — picked up automatically as the catalog's margin cache. Theneighbor_margin_arcseccomes from the margin's ownproperties(hats_margin_threshold). point_map.fits— a per-cell row-count map. ACID requires one for every HATS catalog in a query: it's how the planner sizes work tuples to your RAM budget (see Performance — RAM budget). Everyacidoutput and every catalog from a currenthats-importcarries one; a missing or 0/1-mask map is a clearValidationErrornaming the catalog. The same file doubles as the sky-density footprint auto-loaded when a query references the catalog by name inIN_MOC(<alias>, '<catalog_name>')orCatalog.in_region(<catalog_name>).
You rarely need to override any of these. The most common reason to do so is a non-standard catalog layout (a hive directory, a catalog without canonical HATS properties, or a margin cache in an unusual location).
The YAML registry¶
catalogs:
gaia_dr3:
path: /data/hats/gaia_dr3
twomass_psc:
path: /data/hats/twomass_psc
# Per-catalog overrides are optional. Any field you don't set
# here is auto-detected from properties / partition_info.csv.
# ra_col: ra
# dec_col: decl # 2MASS names its declination column "decl"
# hpix_order: 5
# neighbor_path: /scratch/twomass_margin_10arcsec
# neighbor_margin_arcsec: 10.0
my_targets:
path: /scratch/my_targets
# Hive-partitioned (not HATS): tell acid so it doesn't try to
# read HATS metadata files.
layout: hive
ra_col: ra
dec_col: dec
hpix_order: 5
partition_col: hpix
mocs:
des_dr2_footprint: /data/footprints/des_dr2.fits
delve_dr3: /data/footprints/delve_dr3.fits
The top-level keys:
| Key | Required | What it means |
|---|---|---|
catalogs: |
yes | Map of <name>: {path: ..., ...}. The name is what you pass to acid.open(...). |
mocs: |
no | Map of <name>: <path to FITS MOC>. Registered alongside catalogs, available to IN_MOC(<alias>, '<name>') and .in_region(<name>). |
The per-catalog fields:
| Field | Default | What it means |
|---|---|---|
path |
required | Local path to the catalog directory. Can be a HATS catalog root or a CatalogCollection root (in which case the primary table is resolved from collection.properties). |
layout |
auto (hats if HATS metadata present, else hive) |
hats or hive. Set explicitly only for hive-layout parquet directories. |
ra_col |
auto (hats_col_ra property) |
Name of the RA column. |
dec_col |
auto (hats_col_dec property) |
Name of the Dec column. |
hpix_order |
auto (max Norder from partition_info.csv) |
The catalog's HEALPix order. Required for hive catalogs (no HATS metadata to derive it from). |
partition_col |
Npix (HATS) / hpix (hive) |
The column / directory level holding the HEALPix index. |
neighbor_path |
auto (sibling *_margin* or margin_cache/ subdir) |
Path to the margin cache for this catalog. See Margin caches. |
neighbor_margin_arcsec |
auto (from the margin cache's hats_margin_threshold) |
The cache's recorded width in arcsec. This is what the analyzer compares against radius_arcsec in XMATCH(...). |
Adding catalogs at runtime¶
acid.register_catalog(name, **kwargs) accepts the same per-catalog
fields as the YAML:
acid.init("/data/hats")
# A catalog not in /data/hats:
acid.register_catalog("offsite", path="/mnt/backups/another_catalog")
# Override the auto-detected margin path for an existing entry:
acid.register_catalog(
"twomass_psc",
path="/data/hats/twomass_psc",
neighbor_path="/scratch/twomass_margin_10arcsec",
)
Re-registering an existing name silently replaces the old entry on the
connection — no overwrite= flag needed. The new entry takes effect
on the next acid.open(name).
acid.register_moc(name, source) registers a MOC at runtime; source
is a FITS path, an in-memory mocpy.MOC, or an (N, 2) array of
order-29 [lo, hi) integer ranges (the same shape acid uses
internally).
Listing what's available¶
acid.init("/data/hats")
for cat in acid.list_catalogs():
print(cat.name, cat.margins_arcsec) # gaia_dr3 [10.0, 300.0], ...
list_catalogs() is the Python equivalent of the acid list
CLI: it crawls the connection's roots — local directories, ssh:// hosts,
and http(s):// mirrors alike — and returns one CatalogInfo row
(name, margins_arcsec, root, shadowed) per catalog. A namespaced catalog
surfaces as namespace/child, margin-cache siblings are attributed to
their parent (not listed as catalogs), and a name found at more than one
root is flagged shadowed on the later one (acid.open resolves
first-wins). Catalogs registered explicitly (by YAML or register_catalog)
are included too.
It is opt-in discovery: the crawl is O(roots × subdirs) and can be slow
on remote roots. It does not read full per-catalog metadata — that
happens lazily on acid.open(name), which is the point at which
properties, partition_info.csv, and the margin sibling are inspected.
On a remote root or a slow filesystem, prefer to know the catalog
names by name (from a YAML or your survey docs) and skip the walk;
acid.open("gaia_dr3") works without list_catalogs() having been
called.
Hive-layout catalogs¶
acid also reads catalogs that follow a hive-partitioned layout
(directories like hpix=42/foo.parquet) but do not include the full
HATS metadata. The price is that you must declare the partition
column, the RA / Dec columns, and the HEALPix order in YAML:
catalogs:
my_hive:
path: /scratch/hive_export
layout: hive
partition_col: hpix
ra_col: ra
dec_col: dec
hpix_order: 5
Hive catalogs do not ship a margin cache or point_map.fits, which
limits what they can do:
- They cannot be the right side of an
XMATCH(no margin cache, so the analyzer rejects with the "noneighbor_pathconfigured" error). - Without a
point_map.fits, they also can't be sized by the RAM-budget planner (which needs the row-count map). For an ad-hoc hive export, the simplest path is to open it as a virtual catalog instead —acid.open("/scratch/hive_export/...", ra=…, dec=…)reads a file directly and partitions it for you.
Build a HATS copy of your data (e.g. with hats-import) if you need a
persistent catalog you can crossmatch into repeatedly.
See also¶
- Connections — the
Connectionlifecycle andacid.opensemantics. - Downloading catalogs — fetching a HATS catalog
(with its margin cache) from
data.lsdb.ioover HTTP or SSH. - Margin caches — the
neighbor_pathandneighbor_margin_arcsecfields above, in depth. - Sky regions & footprints — the
mocs:section above, in depth. - Errors —
RegistryError— what the analyzer says when a catalog can't be found, a margin path is missing, or a duplicate-name registration is rejected.