Skip to content

Catalogs and the registry

A Connection knows about a set of catalogs — HATS-shaped directories on local disk — and a set of MOCs — sky-region predicates you can use in WHERE IN_MOC(...) or Catalog.in_region(...). Together they form the registry. This page covers the three ways to populate it, the auto-detection rules, and the per-catalog fields you can override.

The three sources

You pick one (or pass several at once) when you call acid.init(...):

acid.init("/data/hats")

Every immediate subdirectory of /data/hats that contains a properties or hats.properties file is treated as a HATS catalog and is opened by name with acid.open("subdir_name"). Margin caches (whose properties declare dataproduct_type = margin) are skipped as user-visible catalogs — they're attached to their parent catalog instead.

acid.init("/path/to/catalogs.yaml")

A YAML config gives you explicit names, optional overrides, and a place to register MOCs in the same file. The full grammar is in The YAML registry below.

acid.init(["/data/hats", "/scratch/local"])

Multiple directories, searched in order for acid.open(name). Leftmost match wins.

acid.init({"gaia_dr3": {"path": "/data/gaia_dr3"}})

Same shape as the YAML's catalogs: block. Useful for tests and one-off scripts that build the registry programmatically.

acid.init(...) is optional — the first acid.open() / acid.sql.query() lazy-inits a default connection — but call it when you want to pin the source (or workers, threads, …) up front. You can also start empty and add catalogs later:

acid.init([])
acid.register_catalog("gaia_dr3", path="/data/gaia_dr3")
acid.register_catalog("targets", path="/scratch/my_targets")

Raw files and in-memory frames (virtual catalogs)

acid.open(...) also accepts a raw data file or an in-memory frame directly — a virtual catalog. This is the on-ramp for a target list that isn't a HATS catalog, with no offline import:

import acid

acid.init("/data/hats")

# A file on disk: .parquet / .csv / .tsv / .fits / .arrow / .feather / VOTable
targets = acid.open("targets.csv", ra="RA", dec="DEC")

# Or an in-memory frame: NumPy structured array / pandas / polars / pyarrow / Astropy
import pandas as pd
targets = acid.open(pd.read_csv("targets.csv"), ra="RA", dec="DEC")

The source is spilled once at open() to a single memory-mapped Arrow file under the connection's scratch directory (removed on close()), adaptively partitioned by sky density, and from then on behaves like an ordinary (if coarse) catalog — usable as a crossmatch root or operand. A few rules:

  • ra= / dec= are required and never guessed (degrees, ICRS); NULL/NaN-coordinate rows are dropped with a warning.
  • acid.open(<file>) does not register a name, so the SQL escape hatch can't see it. To use a raw file by name in acid.sql.query(...), register it with acid.register_file(name, path, ra=…, dec=…) (or, on the CLI, acid query --open).

See bring your own target list for the crossmatch workflow and the full set of accepted types.

Auto-detected catalog metadata

For each registered catalog, acid reads:

  • properties (or hats.properties) — for hats_col_ra, hats_col_dec, hats_col_healpix, hats_col_healpix_order, hats_margin_threshold, etc.
  • partition_info.csv — for the actual (Norder, Npix) partition list and the maximum Norder (the catalog's hpix_order).
  • A margin_cache/ subdirectory or *_margin* sibling — picked up automatically as the catalog's margin cache. The neighbor_margin_arcsec comes from the margin's own properties (hats_margin_threshold).
  • point_map.fits — a per-cell row-count map. ACID requires one for every HATS catalog in a query: it's how the planner sizes work tuples to your RAM budget (see Performance — RAM budget). Every acid output and every catalog from a current hats-import carries one; a missing or 0/1-mask map is a clear ValidationError naming the catalog. The same file doubles as the sky-density footprint auto-loaded when a query references the catalog by name in IN_MOC(<alias>, '<catalog_name>') or Catalog.in_region(<catalog_name>).

You rarely need to override any of these. The most common reason to do so is a non-standard catalog layout (a hive directory, a catalog without canonical HATS properties, or a margin cache in an unusual location).

The YAML registry

catalogs:
  gaia_dr3:
    path: /data/hats/gaia_dr3

  twomass_psc:
    path: /data/hats/twomass_psc
    # Per-catalog overrides are optional. Any field you don't set
    # here is auto-detected from properties / partition_info.csv.
    # ra_col: ra
    # dec_col: decl     # 2MASS names its declination column "decl"
    # hpix_order: 5
    # neighbor_path: /scratch/twomass_margin_10arcsec
    # neighbor_margin_arcsec: 10.0

  my_targets:
    path: /scratch/my_targets
    # Hive-partitioned (not HATS): tell acid so it doesn't try to
    # read HATS metadata files.
    layout: hive
    ra_col: ra
    dec_col: dec
    hpix_order: 5
    partition_col: hpix

mocs:
  des_dr2_footprint: /data/footprints/des_dr2.fits
  delve_dr3:          /data/footprints/delve_dr3.fits

The top-level keys:

Key Required What it means
catalogs: yes Map of <name>: {path: ..., ...}. The name is what you pass to acid.open(...).
mocs: no Map of <name>: <path to FITS MOC>. Registered alongside catalogs, available to IN_MOC(<alias>, '<name>') and .in_region(<name>).

The per-catalog fields:

Field Default What it means
path required Local path to the catalog directory. Can be a HATS catalog root or a CatalogCollection root (in which case the primary table is resolved from collection.properties).
layout auto (hats if HATS metadata present, else hive) hats or hive. Set explicitly only for hive-layout parquet directories.
ra_col auto (hats_col_ra property) Name of the RA column.
dec_col auto (hats_col_dec property) Name of the Dec column.
hpix_order auto (max Norder from partition_info.csv) The catalog's HEALPix order. Required for hive catalogs (no HATS metadata to derive it from).
partition_col Npix (HATS) / hpix (hive) The column / directory level holding the HEALPix index.
neighbor_path auto (sibling *_margin* or margin_cache/ subdir) Path to the margin cache for this catalog. See Margin caches.
neighbor_margin_arcsec auto (from the margin cache's hats_margin_threshold) The cache's recorded width in arcsec. This is what the analyzer compares against radius_arcsec in XMATCH(...).

Adding catalogs at runtime

acid.register_catalog(name, **kwargs) accepts the same per-catalog fields as the YAML:

acid.init("/data/hats")

# A catalog not in /data/hats:
acid.register_catalog("offsite", path="/mnt/backups/another_catalog")

# Override the auto-detected margin path for an existing entry:
acid.register_catalog(
    "twomass_psc",
    path="/data/hats/twomass_psc",
    neighbor_path="/scratch/twomass_margin_10arcsec",
)

Re-registering an existing name silently replaces the old entry on the connection — no overwrite= flag needed. The new entry takes effect on the next acid.open(name).

acid.register_moc(name, source) registers a MOC at runtime; source is a FITS path, an in-memory mocpy.MOC, or an (N, 2) array of order-29 [lo, hi) integer ranges (the same shape acid uses internally).

Listing what's available

acid.init("/data/hats")
for cat in acid.list_catalogs():
    print(cat.name, cat.margins_arcsec)   # gaia_dr3 [10.0, 300.0], ...

list_catalogs() is the Python equivalent of the acid list CLI: it crawls the connection's roots — local directories, ssh:// hosts, and http(s):// mirrors alike — and returns one CatalogInfo row (name, margins_arcsec, root, shadowed) per catalog. A namespaced catalog surfaces as namespace/child, margin-cache siblings are attributed to their parent (not listed as catalogs), and a name found at more than one root is flagged shadowed on the later one (acid.open resolves first-wins). Catalogs registered explicitly (by YAML or register_catalog) are included too.

It is opt-in discovery: the crawl is O(roots × subdirs) and can be slow on remote roots. It does not read full per-catalog metadata — that happens lazily on acid.open(name), which is the point at which properties, partition_info.csv, and the margin sibling are inspected.

On a remote root or a slow filesystem, prefer to know the catalog names by name (from a YAML or your survey docs) and skip the walk; acid.open("gaia_dr3") works without list_catalogs() having been called.

Hive-layout catalogs

acid also reads catalogs that follow a hive-partitioned layout (directories like hpix=42/foo.parquet) but do not include the full HATS metadata. The price is that you must declare the partition column, the RA / Dec columns, and the HEALPix order in YAML:

catalogs:
  my_hive:
    path: /scratch/hive_export
    layout: hive
    partition_col: hpix
    ra_col: ra
    dec_col: dec
    hpix_order: 5

Hive catalogs do not ship a margin cache or point_map.fits, which limits what they can do:

  • They cannot be the right side of an XMATCH (no margin cache, so the analyzer rejects with the "no neighbor_path configured" error).
  • Without a point_map.fits, they also can't be sized by the RAM-budget planner (which needs the row-count map). For an ad-hoc hive export, the simplest path is to open it as a virtual catalog instead — acid.open("/scratch/hive_export/...", ra=…, dec=…) reads a file directly and partitions it for you.

Build a HATS copy of your data (e.g. with hats-import) if you need a persistent catalog you can crossmatch into repeatedly.

See also

  • Connections — the Connection lifecycle and acid.open semantics.
  • Downloading catalogs — fetching a HATS catalog (with its margin cache) from data.lsdb.io over HTTP or SSH.
  • Margin caches — the neighbor_path and neighbor_margin_arcsec fields above, in depth.
  • Sky regions & footprints — the mocs: section above, in depth.
  • Errors — RegistryError — what the analyzer says when a catalog can't be found, a margin path is missing, or a duplicate-name registration is rejected.