Skip to content

Troubleshooting

This page is the symptom-shaped index of common pitfalls. For each entry: the symptom as you'd see it, the cause, and the fix. The error-class reference is in reference/errors.md; this page picks up where that one leaves off — situations that don't surface as a clean error class, or where the error message is generic and you want a faster way to map it back to its cause.

Crossmatching

"My LEFT JOIN with XMATCH returns nothing — every row's right-side columns are NULL."

Symptom: every row is unmatched, even sources you know have counterparts.

Cause: almost always one of:

  1. Epoch mismatch. ACID treats every catalog's stored RA/Dec as J2000 / ICRS. Matching a J2016 Gaia catalog against a J2000 survey at 1″ silently loses every high-proper-motion source. See crossmatch §1.
  2. Radius too small for the actual separations. Bump the radius (within the margin cache's recorded width — the analyzer rejects anything wider, see below) and run again.
  3. maxmatch=1 (nearest) combined with a pre-filter on the right operand. A pre-filter on the right (a.crossmatch(b.where(...), ...)) makes the matcher only see surviving rows; if the nearest match in the catalog is filtered out, the row returns unmatched even if a farther survivor would qualify.

Fix: propagate to J2000 with Astropy (see the crossmatch guide), widen the radius (after rebuilding the margin cache, below), or move the right-side filter into a post-crossmatch where so the matcher gets a chance to find every candidate first.

ValidationError: XMATCH radius_arcsec=... exceeds ...'s neighbor_margin_arcsec=...

Symptom: every crossmatch above some radius is rejected.

Cause: the right catalog's margin cache is narrower than the radius you're asking for. The cache holds boundary rows out to neighbor_margin_arcsec; a wider radius would silently miss matches that straddle a partition boundary.

Fix: rebuild the margin cache wider, or shrink the XMATCH radius:

acid hats build-margin /data/<catalog> --margin-arcsec 10.0

See Margin caches for the canonical reference.

ValidationError: ... has no neighbor_path (margin cache) configured

Symptom: a crossmatch against a particular catalog won't run at all.

Cause: the catalog has no margin cache attached. Catalogs imported with hats-import ship without one by default; some downloads also omit one if --skip-margin was used.

Fix: build one once:

acid hats build-margin /data/<catalog> --margin-arcsec 10.0

Then re-run. See Margin caches.

Can I crossmatch my own CSV / target list?

Yes, directly. acid.open(...) accepts a raw data file or an in-memory frame as a virtual catalog — no offline HATS import:

byo_targets.py
import acid
import astropy.units as u

acid.init("catalogs.yaml")

# targets.csv has columns RA, DEC (degrees, ICRS).
targets = acid.open("targets.csv", ra="RA", dec="DEC")
near = targets.crossmatch(acid.open("gaia_dr3"), radius=1 * u.arcsec)
tbl = near.to_astropy()

Accepted files: .parquet, .csv, .tsv, .fits, .arrow / .feather, VOTable. Accepted in-memory frames: NumPy structured array, pandas, polars, pyarrow, Astropy Table. The ra= / dec= column names are required (never guessed); NULL/NaN-coordinate rows are dropped with a warning. The full how-to (operand vs. root, the SQL/CLI register_file / --open surface) is on bring your own target list.

You still need a margin cache on the right-side catalog (the one you're matching into) at a radius ≥ your match radius; the virtual target catalog on the left does not.

When hats-import is still the right tool

Virtual catalogs are for target lists and ad-hoc inputs — opened fresh each session, partitioned coarsely. To publish a real, persistently partitioned HATS catalog at survey scale (one others register and query repeatedly), build it with the official hats-import tool, then point acid at the resulting tree with acid.register_catalog(name, path=...). acid reads it like any other HATS catalog.

ValidationError: catalog '<name>' has no point_map.fits

Symptom: a query against a particular catalog is rejected at compile time with a message about a missing point_map.fits.

Cause: every HATS catalog in a query needs a point_map.fits holding per-cell row counts. ACID uses it to size work tuples to fit your RAM budget (see Performance — RAM budget). Every acid output and every catalog from a current hats-import carries one; an older or hand-built catalog may not, or may carry a 0/1 footprint mask instead of real counts.

Fix, depending on where the catalog came from:

  • Downloaded with acid download — it already regenerates a point_map.fits; re-download (or re-run, which resumes) if the tree is incomplete.
  • Written by acid (Catalog.save / acid query --output hats) — it always carries one; this error shouldn't occur, so file an issue if it does.
  • Built by an older hats-import — re-import with a current version, which writes a row-count point_map.fits.
  • An ad-hoc file you don't want to re-import — open it as a virtual catalog instead (acid.open(path, ra=…, dec=…)), which partitions it for you without needing a point_map.fits.

The message names the catalog and its path so you know which one to fix.

Empty / unexpected results

"My query returns zero rows but I expected matches."

Symptom: a query that should produce data returns empty.

Cause:

  1. A debug cone with no coverage. If you execute a query inside a with acid.in_cone(...): block, partitions outside the cone are pruned — so a cone over an empty patch returns nothing.
  2. A WHERE that's tighter than you think. IN_MOC(a, 'name') restricts to a MOC; an unfamiliar MOC name (or a misspelled one) that doesn't auto-resolve as a catalog footprint causes a ValidationError, but a valid MOC over the wrong patch silently returns nothing.
  3. A column-subset download that dropped what you're filtering on. acid download --columns ... skips columns you didn't ask for; a later WHERE on a missing column raises, but a WHERE that always evaluates to NULL on the columns you did keep silently returns nothing.

Fix: run the same query inside a wider cone (or no cone) with a limit(10) first to verify it produces anything; then narrow back down. The debug-small-run-big guide covers the pattern.

Performance & memory

"My job OOM'd in the reduce step."

Symptom: the query runs through phase 1, then dies with ExecutionError and the process RSS spiked at the end.

Cause: phase-2 reduce loaded too much into memory at once. Phase 1 spills past inmem_row_limit automatically (default 50 M rows), so the partials are usually on disk by reduce time — but a r.to_pandas() or r.to_polars() then loads the spill into a single in-memory DataFrame.

Fix: in order of preference:

  1. Stream with r.batches() instead of r.to_pandas(). The spill is disk-backed; streaming reads it lazily. See Results — Streaming.
  2. Lower inmem_row_limit so phase 1 spills earlier and the reduce stays disk-backed end-to-end. See Performance — Memory & spill.
  3. Then reduce workers to lower concurrent RSS, if step 2 isn't enough.

See Errors — ExecutionError for the full remediation order.

"My job runs but is much slower than I expect on a many-core node."

Symptom: wall time doesn't scale with workers.

Cause: usually one of:

  1. You're inside a cgroup with a CPU quota smaller than the host's core count. os.cpu_count() reads the host (e.g. 128), but acid uses cpu_cap(), which reads the actual quota (e.g. 4). workers="auto" honors the quota; explicit workers=64 does not and oversubscribes.
  2. Process-pool churn. Spinning up a new Connection per query costs ~2.5 s of import time. Open one connection and reuse it.
  3. Allocator contention. The default _RJEM_MALLOC_CONF=dirty_decay_ms:-1,muzzy_decay_ms:-1 removes madvise contention at the cost of ~+20 % RSS; if it's been overridden (e.g. to recover RAM), wall time goes up.

Fix: profile with ACID_PROFILE=1 acid query "..."; reuse one Connection; verify cpu_cap() matches the cgroup quota. See Performance & parallelism for the full walkthrough.

"Pool startup takes ages before the first query runs."

Symptom: acid sits idle for several seconds before any work appears in the progress bar.

Cause: worker-pool startup. The forkserver preload is on by default and pre-imports numpy/polars/pyarrow/scipy/cdshealpix; that takes ~2.5 s once. With ACID_PREWARM on (default), all workers come up together behind a barrier before the first query.

Fix: the real answer for most users is reuse a single Connection across queries — startup is paid once and amortises across everything that follows. Only if you're running many short one-off queries where the ~2.5 s bootstrap dominates the actual work should you reach for ACID_FORKSERVER_PRELOAD=0 (which trades startup latency for per-worker import cost). See Performance — Worker-startup knobs.

Downloads

"acid download failed partway through. Is the directory usable?"

Symptom: the download exited non-zero. There's a partial tree on disk.

Cause / fix: acid download is designed to either complete or fail loudly — a half-downloaded HATS tree looks structurally valid but silently misses data, so a partial exit-zero would be the worst possible failure mode. The behavior:

  • The first worker exception aborts the download; queued transfers are cancelled and the CLI exits non-zero.
  • Re-running the same command resumes: files already on disk are skipped. Retry is cheap.

Just re-run the command. If the underlying network is unreliable, raise --timeout and re-run; if SSH retries exhaust, fix the link or switch to HTTP. See Downloading catalogs.

"acid download --estimate doesn't print a precise size."

Symptom: the estimate is an order of magnitude off, or shows "approximate".

Cause: without --prefetch-metadata, the estimate is derived from partition_info.csv and per-partition averages; it doesn't read the catalog's _metadata for exact byte ranges.

Fix: pass --prefetch-metadata to fetch _metadata once and get precise figures. Note _metadata can be hundreds of MB on wide catalogs (a Rubin object catalog is ~950 MB), so the prefetch itself isn't free.

"I downloaded the catalog but the analyzer says it has no margin cache."

Symptom: ValidationError: ... has no neighbor_path (margin cache) configured, after a fresh acid download.

Cause: either the source catalog had no margin cache (some catalogs at data.lsdb.io are published without one), or you passed --skip-margin during the download.

Fix: build a local margin cache:

acid hats build-margin /data/<catalog> --margin-arcsec 10.0

See Margin caches.

Finding & resolving catalogs

"A catalog name won't resolve — catalog '<name>' not found."

Symptom: acid query / acid.open(...), acid download, or acid inspect rejects a bare catalog name. The error carries a per-root trail and a hint:

error[registry]:
  > catalog 'gaia_dr_3' not found; searched roots […] and registered names
  = help: did you mean 'gaia_dr3'?
  = note: run `acid download gaia_dr_3` to fetch it, or `acid search` to find it
          or pass an absolute path / URL, register via add_catalog(...), or use a YAML config

Cause: the name didn't match any registered catalog, any catalog on the searched roots, or — for acid download — any catalog on the download path. Most often it's a typo, a catalog that lives under a different root than the one you're searching, or a catalog you haven't downloaded yet.

Fix: the error tells you which it is. Work through it in order:

  1. Read the per-root trail. Each root that was searched is listed with its outcome — no match, unreachable: <reason>, or malformed collection. This tells you where acid looked and what it found at each place.

  2. Take the did you mean '<closest>'? hint seriously. It's drawn from catalogs acid already knows about (registered names, catalogs on your local roots, and cached download listings), so a suggestion is a real, reachable catalog — usually your typo corrected (gaia_dr_3gaia_dr3, two_asstwo_mass).

  3. A root reported unreachable is not the same as "catalog absent". If the trail shows an unreachable: <reason> line and the note says "a root was unreachable — the catalog may exist there but couldn't be checked," a transport failure (an SSH host that's down, a connection refused / timed out, a 5xx from a mirror) stopped acid from checking that root. The catalog may well be there. Fix or re-try the connection (check the VPN / SSH host, raise --timeout, try again) rather than assuming the catalog is gone. ACID no longer treats a probe that errored as "not found" — it surfaces the failure so you don't chase a phantom typo when the real problem is the network.

  4. Discover and fetch. For a name you haven't downloaded, the next step depends on the command:

    • acid query / acid.open point you at acid download <name> to fetch the catalog first. If the closest match is itself available to download, the error says so directly — "<name> is available to download — run acid download <name>."
    • acid download / acid inspect point you at acid search <name> to find a downloadable catalog, or at passing an explicit URL / path.

Running acid search once makes future suggestions smarter

The did you mean hint never triggers a fresh network crawl — it reads only what's already cached locally (the ~1-hour acid search cache) plus a cheap walk of your local roots. So the first time you look for a catalog on a remote mirror, the suggestion list may be empty or thin. Run acid search once to populate the cache, and subsequent not-found errors can suggest the right remote name.

Notebook / Connection lifecycle

ConnectionClosedError even though I never called close()

Symptom: a Catalog handle raises ConnectionClosedError partway through a notebook.

Cause: the underlying Connection was torn down — usually because you called acid.shutdown() (or built an explicit acid.Connection(...) as a context manager and let its with block exit) while a Catalog handle from it was still in scope.

Fix: for the common case, just keep using the module-level API (acid.open(...), acid.sql.query(...)) — it lazy-reopens a default Connection on the next call, so re-running the cell that built the Catalog revives it. Only call acid.shutdown() when you actually want to drop the pool. If you opted into an explicit acid.Connection for isolation, keep its with block open for as long as you use its catalogs. See Connections.

Platform

"Does acid run on Windows?"

No. acid is Linux/macOS only — it uses POSIX-specific primitives (os.fork, fcntl, the forkserver start method, cgroup files) that have no equivalent on Windows. Use WSL2.

"My LSDB / hats.read_hats(...) import won't open an acid-written catalog."

Symptom: an output written via Catalog.save(...) or acid query --output looks like a HATS tree but external tools complain.

Cause: rare. HATS spec compliance is part of ACID's test suite (tests/test_gaia_hats.py::test_write_produces_valid_hats covers the round-trip), but bugs happen, especially on edge cases (empty partitions, unusual schema types).

Fix: check that the tree includes properties, partition_info.csv, and a populated dataset/ directory. If those all look right, open an issue on GitHub with the directory listing and the consuming tool's error.