Margin caches¶
A margin cache is a small sibling catalog that holds copies of the rows sitting within a few arcseconds of each HEALPix partition's outer boundary. Without one, a crossmatch silently misses pairs that straddle a boundary — the partition reads only its own pixel, so a Gaia source on one side of a pixel edge and a 2MASS counterpart on the other never appear in the same partition at the same time, and the match is dropped without a warning.
This page covers:
- What a margin cache is, and why every right-side catalog in a crossmatch needs one.
- The radius-vs-margin rule: ACID rejects a crossmatch whose radius exceeds the cache's recorded width, at analyze time, with the fix in the error hint.
- The full
acid hats build-marginCLI surface — flags, defaults, and when each one matters. - Sizing
--margin-arcsecfor your science. - When to rebuild a cache vs. when the one shipped with your catalog download is already enough.
- The spill mechanics — what
--mem-limitcontrols and what to lower when the build runs out of memory.
What a margin cache is¶
A HATS catalog is partitioned by HEALPix pixel. Each partition file is
the set of rows whose _healpix_29 value falls inside one pixel at the
catalog's Norder — and only those rows. That partitioning is what
makes large catalogs queryable; it is also what makes naive
crossmatching wrong at partition boundaries.
A margin cache is a second, parallel HATS tree built alongside the main
catalog. For each partition pixel P, the corresponding margin file
holds every row in the whole catalog whose position is within
margin_arcsec of pixel P's border but in a different pixel. When
ACID crossmatches the right catalog against an anchor, the matcher
reads the right partition's data and its margin file, so
boundary-crossing pairs are found correctly.
You will not typically open a margin cache directly. The registry picks
it up from the right-side catalog's collection.properties (a
_margin sibling or a HATS-canonical path) and threads its recorded
width (neighbor_margin_arcsec) into every analyzer check that involves
that catalog. Your job is to make sure one exists and is wide enough for
the radius you want to match at.
Anchor side does not need a margin
The catalog you call .crossmatch(...) on (the anchor) does not need
a margin cache. The matcher only needs the right side's margin —
the side passed in as the argument. A target list on the left has
no margin requirement.
The rule: XMATCH radius must not exceed neighbor_margin_arcsec¶
ACID rejects any crossmatch whose radius_arcsec is larger than the
right catalog's recorded neighbor_margin_arcsec, at analyze time —
before any data is read:
ValidationError: XMATCH radius_arcsec=5.0 exceeds 'twomass_psc'`s
neighbor_margin_arcsec=1.0; matches near partition boundaries would
be silently missed
hint: rebuild the margin cache at a larger radius, or shrink XMATCH
radius.
You have two fixes, both pulled straight from the hint:
- Shrink the XMATCH radius so it fits inside the existing cache. No I/O; instant. Take this path if the wider radius isn't load-bearing for your science.
- Rebuild the cache at a larger radius with
acid hats build-margin(next section), then point the registry at the new cache (or overwrite the old one in place).
If the right catalog has no margin cache at all, the analyzer fires a different rejection earlier:
ValidationError: XMATCH right table 'twomass_psc' has no neighbor_path
(margin cache) configured
hint: build one with `acid hats build-margin <catalog>`.
Same fix: run the builder. The crossmatch guide discusses why this is a hard rejection rather than a warning — the boundary-miss failure mode is silent, the result still looks right, and only the pairs you most needed are missing.
acid hats build-margin — building a margin cache¶
The CLI builds (or rebuilds) a margin cache for a local HATS catalog in place:
acid hats build-margin CATALOG
[--margin-arcsec ARCSEC]
[--workers N]
[--output DIR]
[--overwrite]
[--mem-limit GB]
| Flag | Default | Description |
|---|---|---|
CATALOG |
required | Path to a local HATS catalog directory. |
--margin-arcsec ARCSEC |
10.0 |
Margin threshold in arcseconds. Must be at least as large as the largest XMATCH radius you will ever use against this catalog (see Sizing the radius). |
--workers N |
cgroup-aware cpu_cap |
Parallel worker processes. Defaults to the number of cores actually available to the process (honors cgroup quotas — see Performance & parallelism). |
--output DIR |
<catalog>_<margin>arcsec sibling |
Output directory. The default writes a sibling at the same level as CATALOG. |
--overwrite |
off | Overwrite an existing margin cache at --output. Without it, a pre-existing directory is a hard error. |
--mem-limit GB |
10 % of available RAM (min 8 GB fallback) | Memory limit at which the build's accumulator spills to disk (see Spilling mechanics). |
The output is a HATS-shaped directory that hats.read_hats(...)
opens cleanly and that ACID's registry auto-discovers.
Example¶
This writes /data/two_mass_10arcsec/ (the default --output location).
Re-running with --overwrite is safe.
Python equivalent¶
The CLI is a thin wrapper around
acid.tools.build_margin.build_margin_cache;
call it directly when you want to fold the build into a longer Python
pipeline:
from acid.tools.build_margin import build_margin_cache
build_margin_cache(
"/data/two_mass",
margin_arcsec=10.0,
workers=16,
mem_limit_gb=8.0, # default is 10 % of available RAM
)
workers=1 is the easiest mode to debug build failures in; switch to
the parallel default once the single-process build runs clean.
Sizing the radius¶
Pick --margin-arcsec to be at least as large as the largest XMATCH
radius you plan to run against this catalog. The exact rule the
analyzer enforces is radius_arcsec <= neighbor_margin_arcsec; below
that, you get an analyze-time rejection (see above).
A few practical considerations:
- Pick the maximum radius across all your foreseeable queries, then add a comfortable margin. Rebuilding the cache is a one-shot cost that you pay once per catalog; running into the rejection three months later because your new study needs 2″ instead of 1″ is more annoying than going wider up front.
- For high-proper-motion catalogs, the radius needs headroom for the J2000 epoch gap. ACID treats every catalog's stored RA/Dec as J2000 / ICRS with no propagation (see crossmatch guide §1). Matching a J2016.0 Gaia catalog against a J2000 survey at a 1″ radius will silently lose Barnard's Star (145″ offset). The recommended fix is to propagate to J2000 before registering; if you can't, widen both the XMATCH radius and the cache to absorb the largest expected offset.
- A larger margin costs proportionally more disk space and build time. The cache typically holds a single-digit percentage of the catalog at the default 10″; doubling the radius roughly doubles the margin cache size.
For most extragalactic / static-sky use, 10.0 (the default) is more
than enough. The default exists because typical crossmatches sit
comfortably inside it; widen only when you have a concrete reason.
Rebuilding vs. ingesting one¶
Catalogs you obtain in three different ways need different handling:
acid download <hats-url>— the download includes the catalog's published margin cache automatically. Pass--skip-marginto exclude it (useful if disk is tight and you do not plan to crossmatch); otherwise you are done andacid hats build-marginis unnecessary unless you need a wider radius than the published cache supports.hats-import(you imported your own data) — the upstreamhats-importtool builds a main catalog but does not include a margin cache by default. Runacid hats build-marginonce over the imported catalog before you crossmatch it. The two builders write the same on-disk layout; the cache works regardless of which one produced it.- An existing HATS catalog whose published margin is too narrow.
Rebuild over the original catalog with a larger
--margin-arcsec, then either move the new cache into place as the sibling or update the catalog'scollection.propertiesto point at it. The registry picks up whichever<name>_margin*/ canonical-path sibling it finds.
You may also need to rebuild when you join two catalogs at a finer
HEALPix Norder than the cache supports — the refinement reads the
cache at the finer pixel scale, and a coarsely-built cache cannot
satisfy boundary queries at a finer order.
Spilling mechanics¶
The builder is a two-phase pipeline:
- Phase 0 computes the boundary tables (which output partitions each input partition can contribute margin rows to). This is cheap.
- Phase 1 scans the catalog's partition files in parallel. As rows
near a partition border are discovered, they accumulate in memory,
keyed by their destination output partition. A
ParquetWriteris opened per destination on demand.
When the in-memory accumulator passes --mem-limit (default 10 % of
available RAM), the builder spills the largest accumulating partitions
to disk and continues. Each spill appends row groups to the per-output
parquet file — no full rewrite. After phase 1, the disk-spilled groups
and the remaining in-memory partials are stitched together into the
final per-pixel margin files.
The practical knob to remember:
If build-margin runs out of memory, lower --mem-limit
Phase 1 holds accumulating margin rows in RAM until they spill. A
common failure mode on memory-tight nodes is the default spill
threshold being too high relative to what is actually free. Pass
--mem-limit 4 (or smaller) to force earlier spills; the build
runs a little slower but stays well under the limit. The default
is 10 % of RAM, which is conservative but assumes the build has
that share to itself.
--workers is the other coarse lever: more workers finish phase 1
faster but each worker holds its own slice of the accumulator. On
cgroup-quota-restricted nodes the default already respects the actual
core ceiling
(see Performance & parallelism),
so usually you only need to override it to lower it (e.g. to leave
headroom for a parallel acid query job on the same box).
See also¶
- Crossmatching catalogs — the user-facing rules the margin cache underwrites, including the radius-vs-margin error and its fix.
- Performance & parallelism — the cgroup-aware
worker / thread story, allocator tuning, and the spill knobs at
query time (
inmem_row_limit) that are sometimes confused with this page's--mem-limit. - Downloading catalogs — how
acid downloadships the catalog's margin cache by default, and what--skip-marginchanges. - CLI reference —
the canonical, kept-up-to-date list of every
acid hats build-marginflag. - Troubleshooting — symptom entries for "OOM during build-margin", "build-margin slow on a many-core node", "I built a cache but the analyzer still rejects my radius".