Troubleshooting¶
This page is the symptom-shaped index of common pitfalls. For each
entry: the symptom as you'd see it, the cause, and the
fix. The error-class reference is in
reference/errors.md; this page picks up where
that one leaves off — situations that don't surface as a clean error
class, or where the error message is generic and you want a faster way
to map it back to its cause.
Crossmatching¶
"My LEFT JOIN with XMATCH returns nothing — every row's right-side columns are NULL."¶
Symptom: every row is unmatched, even sources you know have counterparts.
Cause: almost always one of:
- Epoch mismatch. ACID treats every catalog's stored RA/Dec as J2000 / ICRS. Matching a J2016 Gaia catalog against a J2000 survey at 1″ silently loses every high-proper-motion source. See crossmatch §1.
- Radius too small for the actual separations. Bump the radius (within the margin cache's recorded width — the analyzer rejects anything wider, see below) and run again.
maxmatch=1(nearest) combined with a pre-filter on the right operand. A pre-filter on the right (a.crossmatch(b.where(...), ...)) makes the matcher only see surviving rows; if the nearest match in the catalog is filtered out, the row returns unmatched even if a farther survivor would qualify.
Fix: propagate to J2000 with Astropy
(see the crossmatch guide),
widen the radius (after rebuilding the margin cache, below), or move
the right-side filter into a post-crossmatch where so the matcher
gets a chance to find every candidate first.
ValidationError: XMATCH radius_arcsec=... exceeds ...'s neighbor_margin_arcsec=...¶
Symptom: every crossmatch above some radius is rejected.
Cause: the right catalog's margin cache is narrower than the
radius you're asking for. The cache holds boundary rows out to
neighbor_margin_arcsec; a wider radius would silently miss matches
that straddle a partition boundary.
Fix: rebuild the margin cache wider, or shrink the XMATCH radius:
See Margin caches for the canonical reference.
ValidationError: ... has no neighbor_path (margin cache) configured¶
Symptom: a crossmatch against a particular catalog won't run at all.
Cause: the catalog has no margin cache attached. Catalogs imported
with hats-import ship without one by default; some downloads also
omit one if --skip-margin was used.
Fix: build one once:
Then re-run. See Margin caches.
Can I crossmatch my own CSV / target list?¶
Yes, directly. acid.open(...) accepts a raw data file or an
in-memory frame as a virtual catalog — no offline HATS import:
import acid
import astropy.units as u
acid.init("catalogs.yaml")
# targets.csv has columns RA, DEC (degrees, ICRS).
targets = acid.open("targets.csv", ra="RA", dec="DEC")
near = targets.crossmatch(acid.open("gaia_dr3"), radius=1 * u.arcsec)
tbl = near.to_astropy()
Accepted files: .parquet, .csv, .tsv, .fits, .arrow /
.feather, VOTable. Accepted in-memory frames: NumPy structured array,
pandas, polars, pyarrow, Astropy Table. The ra= / dec= column
names are required (never guessed); NULL/NaN-coordinate rows are
dropped with a warning. The full how-to (operand vs. root, the SQL/CLI
register_file / --open surface) is on
bring your own target list.
You still need a margin cache on the right-side catalog (the one you're matching into) at a radius ≥ your match radius; the virtual target catalog on the left does not.
When hats-import is still the right tool
Virtual catalogs are for target lists and ad-hoc inputs — opened
fresh each session, partitioned coarsely. To publish a real,
persistently partitioned HATS catalog at survey scale (one others
register and query repeatedly), build it with the official
hats-import tool, then
point acid at the resulting tree with acid.register_catalog(name,
path=...). acid reads it like any other HATS catalog.
ValidationError: catalog '<name>' has no point_map.fits¶
Symptom: a query against a particular catalog is rejected at
compile time with a message about a missing point_map.fits.
Cause: every HATS catalog in a query needs a point_map.fits
holding per-cell row counts. ACID uses it to size work tuples to
fit your RAM budget (see Performance — RAM budget).
Every acid output and every catalog from a current hats-import
carries one; an older or hand-built catalog may not, or may carry a
0/1 footprint mask instead of real counts.
Fix, depending on where the catalog came from:
- Downloaded with
acid download— it already regenerates apoint_map.fits; re-download (or re-run, which resumes) if the tree is incomplete. - Written by
acid(Catalog.save/acid query --output hats) — it always carries one; this error shouldn't occur, so file an issue if it does. - Built by an older
hats-import— re-import with a current version, which writes a row-countpoint_map.fits. - An ad-hoc file you don't want to re-import — open it as a virtual
catalog instead (
acid.open(path, ra=…, dec=…)), which partitions it for you without needing apoint_map.fits.
The message names the catalog and its path so you know which one to fix.
Empty / unexpected results¶
"My query returns zero rows but I expected matches."¶
Symptom: a query that should produce data returns empty.
Cause:
- A debug cone with no coverage. If you execute a query inside a
with acid.in_cone(...):block, partitions outside the cone are pruned — so a cone over an empty patch returns nothing. - A
WHEREthat's tighter than you think.IN_MOC(a, 'name')restricts to a MOC; an unfamiliar MOC name (or a misspelled one) that doesn't auto-resolve as a catalog footprint causes aValidationError, but a valid MOC over the wrong patch silently returns nothing. - A column-subset download that dropped what you're filtering on.
acid download --columns ...skips columns you didn't ask for; a laterWHEREon a missing column raises, but aWHEREthat always evaluates to NULL on the columns you did keep silently returns nothing.
Fix: run the same query inside a wider cone (or no cone) with a
limit(10) first to verify it produces anything; then narrow back
down. The debug-small-run-big guide
covers the pattern.
Performance & memory¶
"My job OOM'd in the reduce step."¶
Symptom: the query runs through phase 1, then dies with
ExecutionError and the process RSS spiked at the end.
Cause: phase-2 reduce loaded too much into memory at once. Phase 1
spills past inmem_row_limit automatically (default 50 M rows), so
the partials are usually on disk by reduce time — but a r.to_pandas()
or r.to_polars() then loads the spill into a single in-memory
DataFrame.
Fix: in order of preference:
- Stream with
r.batches()instead ofr.to_pandas(). The spill is disk-backed; streaming reads it lazily. See Results — Streaming. - Lower
inmem_row_limitso phase 1 spills earlier and the reduce stays disk-backed end-to-end. See Performance — Memory & spill. - Then reduce
workersto lower concurrent RSS, if step 2 isn't enough.
See Errors — ExecutionError for the full remediation order.
"My job runs but is much slower than I expect on a many-core node."¶
Symptom: wall time doesn't scale with workers.
Cause: usually one of:
- You're inside a cgroup with a CPU quota smaller than the host's
core count.
os.cpu_count()reads the host (e.g. 128), butacidusescpu_cap(), which reads the actual quota (e.g. 4).workers="auto"honors the quota; explicitworkers=64does not and oversubscribes. - Process-pool churn. Spinning up a new
Connectionper query costs ~2.5 s of import time. Open one connection and reuse it. - Allocator contention. The default
_RJEM_MALLOC_CONF=dirty_decay_ms:-1,muzzy_decay_ms:-1removesmadvisecontention at the cost of ~+20 % RSS; if it's been overridden (e.g. to recover RAM), wall time goes up.
Fix: profile with ACID_PROFILE=1 acid query "..."; reuse one
Connection; verify cpu_cap() matches the cgroup quota. See
Performance & parallelism for the full
walkthrough.
"Pool startup takes ages before the first query runs."¶
Symptom: acid sits idle for several seconds before any work
appears in the progress bar.
Cause: worker-pool startup. The forkserver preload is on by
default and pre-imports numpy/polars/pyarrow/scipy/cdshealpix; that
takes ~2.5 s once. With ACID_PREWARM on (default), all workers come
up together behind a barrier before the first query.
Fix: the real answer for most users is reuse a single Connection
across queries — startup is paid once and amortises across everything
that follows. Only if you're running many short one-off queries where
the ~2.5 s bootstrap dominates the actual work should you reach for
ACID_FORKSERVER_PRELOAD=0 (which trades startup latency for
per-worker import cost). See
Performance — Worker-startup knobs.
Downloads¶
"acid download failed partway through. Is the directory usable?"¶
Symptom: the download exited non-zero. There's a partial tree on disk.
Cause / fix: acid download is designed to either complete or
fail loudly — a half-downloaded HATS tree looks structurally valid
but silently misses data, so a partial exit-zero would be the worst
possible failure mode. The behavior:
- The first worker exception aborts the download; queued transfers are cancelled and the CLI exits non-zero.
- Re-running the same command resumes: files already on disk are skipped. Retry is cheap.
Just re-run the command. If the underlying network is unreliable,
raise --timeout and re-run; if SSH retries exhaust, fix the link or
switch to HTTP. See Downloading catalogs.
"acid download --estimate doesn't print a precise size."¶
Symptom: the estimate is an order of magnitude off, or shows "approximate".
Cause: without --prefetch-metadata, the estimate is derived
from partition_info.csv and per-partition averages; it doesn't read
the catalog's _metadata for exact byte ranges.
Fix: pass --prefetch-metadata to fetch _metadata once and get
precise figures. Note _metadata can be hundreds of MB on wide
catalogs (a Rubin object catalog is ~950 MB), so the prefetch itself
isn't free.
"I downloaded the catalog but the analyzer says it has no margin cache."¶
Symptom: ValidationError: ... has no neighbor_path (margin cache)
configured, after a fresh acid download.
Cause: either the source catalog had no margin cache (some
catalogs at data.lsdb.io are published without one), or you passed
--skip-margin during the download.
Fix: build a local margin cache:
See Margin caches.
Finding & resolving catalogs¶
"A catalog name won't resolve — catalog '<name>' not found."¶
Symptom: acid query / acid.open(...), acid download, or
acid inspect rejects a bare catalog name. The error carries a per-root
trail and a hint:
error[registry]:
> catalog 'gaia_dr_3' not found; searched roots […] and registered names
= help: did you mean 'gaia_dr3'?
= note: run `acid download gaia_dr_3` to fetch it, or `acid search` to find it
or pass an absolute path / URL, register via add_catalog(...), or use a YAML config
Cause: the name didn't match any registered catalog, any catalog on
the searched roots, or — for acid download — any catalog on the
download path. Most often it's a typo, a catalog that lives under a
different root than the one you're searching, or a catalog you haven't
downloaded yet.
Fix: the error tells you which it is. Work through it in order:
-
Read the per-root trail. Each root that was searched is listed with its outcome —
no match,unreachable: <reason>, ormalformed collection. This tells you whereacidlooked and what it found at each place. -
Take the
did you mean '<closest>'?hint seriously. It's drawn from catalogsacidalready knows about (registered names, catalogs on your local roots, and cached download listings), so a suggestion is a real, reachable catalog — usually your typo corrected (gaia_dr_3→gaia_dr3,two_ass→two_mass). -
A root reported
unreachableis not the same as "catalog absent". If the trail shows anunreachable: <reason>line and the note says "a root was unreachable — the catalog may exist there but couldn't be checked," a transport failure (an SSH host that's down, a connection refused / timed out, a 5xx from a mirror) stoppedacidfrom checking that root. The catalog may well be there. Fix or re-try the connection (check the VPN / SSH host, raise--timeout, try again) rather than assuming the catalog is gone. ACID no longer treats a probe that errored as "not found" — it surfaces the failure so you don't chase a phantom typo when the real problem is the network. -
Discover and fetch. For a name you haven't downloaded, the next step depends on the command:
acid query/acid.openpoint you atacid download <name>to fetch the catalog first. If the closest match is itself available to download, the error says so directly — "<name>is available to download — runacid download <name>."acid download/acid inspectpoint you atacid search <name>to find a downloadable catalog, or at passing an explicit URL / path.
Running acid search once makes future suggestions smarter
The did you mean hint never triggers a fresh network crawl — it
reads only what's already cached locally (the ~1-hour acid search
cache) plus a cheap walk of your local roots. So the first time you
look for a catalog on a remote mirror, the suggestion list may be empty
or thin. Run acid search
once to populate the cache, and subsequent not-found errors can suggest
the right remote name.
Notebook / Connection lifecycle¶
ConnectionClosedError even though I never called close()¶
Symptom: a Catalog handle raises ConnectionClosedError partway
through a notebook.
Cause: the underlying Connection was torn down — usually because
you called acid.shutdown() (or built an explicit acid.Connection(...)
as a context manager and let its with block exit) while a Catalog
handle from it was still in scope.
Fix: for the common case, just keep using the module-level API
(acid.open(...), acid.sql.query(...)) — it lazy-reopens a default
Connection on the next call, so re-running the cell that built the
Catalog revives it. Only call acid.shutdown() when you actually
want to drop the pool. If you opted into an explicit acid.Connection
for isolation, keep its with block open for as long as you use its
catalogs. See Connections.
Platform¶
"Does acid run on Windows?"¶
No. acid is Linux/macOS only — it uses POSIX-specific
primitives (os.fork, fcntl, the forkserver start method, cgroup
files) that have no equivalent on Windows. Use WSL2.
"My LSDB / hats.read_hats(...) import won't open an acid-written catalog."¶
Symptom: an output written via Catalog.save(...) or acid query
--output looks like a HATS tree but external tools complain.
Cause: rare. HATS spec compliance is part of ACID's
test suite (tests/test_gaia_hats.py::test_write_produces_valid_hats
covers the round-trip), but bugs happen, especially on edge cases
(empty partitions, unusual schema types).
Fix: check that the tree includes properties,
partition_info.csv, and a populated dataset/ directory. If those
all look right, open an issue on
GitHub with the directory
listing and the consuming tool's error.