Downloading catalogs¶
acid download fetches a HATS catalog (or a slice of one) from a
remote source to local disk, in a HATS-valid layout you can hand
straight to acid.init("./that/dir/"). It supports HTTP, SSH, and
local-filesystem sources, with spatial- and column-subset modes for
trimming the download to what you actually need.
This page covers the common cases:
- Finding which catalogs are available before you download.
- Fetching a full catalog by name, or by explicit URL.
- Subsetting by cone (only download partitions overlapping a sky region) and by columns.
- The hard guarantee that downloads either complete or fail loudly — no half-downloaded catalogs that look structurally valid.
- Inspecting a remote catalog before committing to the download.
The full flag set with defaults is in the CLI reference; this page is the task-shaped overview.
Finding catalogs¶
Before downloading, list what's available with acid search:
gaia_dr3 margins: 10, 300 arcsec
two_mass margins: 5 arcsec
wise/allwise margins: none
ztf_dr22 margins: 10 arcsec
Each line is one catalog you can hand straight to acid download. The
name on the left is the download token; the margins: column lists the
margin-cache radii (in arcseconds, ICRS) available for that catalog.
Check the margins before you download
A margin cache is what makes boundary-crossing crossmatches correct,
and you can only crossmatch against a catalog whose margin cache is
at least as wide as your match radius (see the
margin caches guide). The margins: column lets
you confirm a wide-enough margin exists before committing to the
download — margins: none means you'll have to build one yourself
with acid hats build-margin afterward.
acid search crawls every root on the download path over its native
transport — local directories, ssh:// hosts, and http(s):// mirrors
alike. Pass a substring to filter (case-insensitive):
acid search gaia # only catalogs whose name contains "gaia"
acid search wise # surfaces wise/allwise, wise/catwise, ...
Catalogs grouped under a namespace directory on the mirror surface as
namespace/child (e.g. wise/allwise). That two-part name is exactly
the token acid download expects — see
Downloading by name below.
HATS collections are just catalogs here
A catalog published as a HATS collection (a directory bundling the
primary table with its margin caches) shows up in acid search as a
plain catalog name — the "collection" structure is read internally to
find the margins but never surfaced. You download it by its name like
any other.
Listings are cached¶
Remote (http / ssh) listings are cached under
$XDG_CACHE_HOME/acid/downloads for about an hour, so a repeated
acid search over a slow mirror returns instantly. Local roots are
always crawled live. To force a fresh crawl:
acid search --cache refresh # re-crawl, rewriting the cached listing
acid search --cache off # bypass the cache entirely (read and write)
Scripting against acid search¶
On a terminal, acid search prints an aligned, colored table with a
live spinner while it crawls. When piped, it switches to clean
tab-separated output — name⇥margins⇥root⇥shadowed-marker — so it
composes with the usual shell tools:
# Names only, of catalogs that acid download would actually fetch
# (the 4th column, the shadowed marker, is empty).
acid search | awk -F'\t' '$4 == "" { print $1 }'
The fourth column and the shadowed marker matter when your download
path has more than one root — see Shadowing.
The shortest download¶
Downloading by name¶
The simplest download is by name — whatever acid search listed, you
can fetch:
That resolves gaia_dr3 against the download path, fetches the full
catalog (including its margin cache), and lands it under your local
catalog path as <ACID_PATH>/gaia_dr3, re-openable by that bare name.
A nested name lands under its leaf segment — the namespace is a
remote locator, not part of the local identity:
You can name the destination explicitly as a second argument when you want it somewhere specific:
Downloading by explicit URL or path¶
You can also point acid download straight at a source, bypassing
name resolution. An explicit source is anything that is a URL or a
filesystem path (a leading ./, /, or ~):
acid download accepts three explicit source kinds:
| Source | Example |
|---|---|
| HTTP / HTTPS | https://data.lsdb.io/hats/two_mass/two_mass |
| SSH | user@server:/sdf/data/hats/gaia_dr3 |
| Local path | /mnt/mirror/hats/two_mass |
An explicit source requires an explicit destination — there's no catalog name to derive one from.
To copy from a local directory, prefix it with ./ or /
A bare token like gaia_dr3 is always treated as a name to look
up on the download path, never as a local directory. To copy from a
relative local directory, give it a leading ./ (./gaia_dr3) or an
absolute path (/mnt/mirror/gaia_dr3) so acid download uses it
verbatim instead of searching the mirror for a catalog called
gaia_dr3.
SSH uses the ssh subprocess, which means ~/.ssh/config (and its
ProxyJump / aliased hosts) is honored. paramiko and fsspec do not
honor ~/.ssh/config, so acid download uses the subprocess shape
deliberately.
The download path¶
When you download by name, the name is resolved against the
download path — a first-wins list of roots, configurable via
ACID_DOWNLOAD_PATH, an acid.conf download_path = line, or the
built-in default. Out of the box that default is two roots, searched
in order:
https://data.lsdb.io/hats/— the public LSDB HATS mirror.ssh://slacd/sdf/home/m/mjuric/datasets— the SLAC datasets dir.
The first root that has a catalog of the requested name wins.
When a name resolves against no root, the error is a per-root
diagnostic rather than a bare "not found": each root on the download path
is listed with its outcome — no match, unreachable: <reason>, or
malformed collection — plus a did you mean '<closest>'? suggestion
drawn from cached / local listings, and a pointer to acid search. A
root that errored (an SSH host that's down, a connection refused, a
5xx from a mirror) is called out as unreachable rather than silently
treated as missing — the catalog may exist there but couldn't be checked,
so fix or re-try the connection before assuming it's gone. See
Troubleshooting — Finding & resolving catalogs
for the full walkthrough.
When roots shadow each other¶
When more than one root holds a catalog of the same name, acid search
shows every occurrence but flags the later ones — acid download
resolves first-wins, so a later copy is shadowed: it would never be the
one fetched by name. On a terminal the shadowed row carries a trailing
*; when piped, the fourth TSV column reads shadowed. The summary line
reports the count, e.g. ✓ 12 catalogs (2 shadowed), with a footnote
explaining the marker. To actually fetch a shadowed copy, download it by
its explicit URL or path instead of by name.
After downloading¶
Once a download finishes, the destination is itself a valid HATS catalog
directory — point acid at it and you can query immediately:
import acid
acid.init("/data/hats") # the ACID_PATH root the download landed under
print([c.name for c in acid.list_catalogs()])
cat = acid.open("gaia_dr3")
print(cat.head(10).to_astropy())
Subset downloads¶
Most users do not need the whole catalog — a science question only touches a small patch of sky and a handful of columns. Two flags trim the download:
--cone RA,DEC,RADIUS_DEG — spatial subset¶
Only partitions overlapping the cone are downloaded. ICRS, degrees.
The resulting /data/two_mass/ is a valid HATS catalog whose
partition_info.csv and _metadata reflect only the downloaded
partitions; a point_map.fits is regenerated from the subset so the
analyzer's footprint pruning still works locally.
--columns COL,COL,... — column subset¶
Only the named columns plus the HATS-required ones (RA, Dec,
_healpix_29) are pulled.
This is dramatic on wide catalogs: a Rubin object catalog with 1 250 columns downloaded with three columns of interest pulls a few percent of the bytes. The catalog still works for queries over those columns; queries referencing missing columns raise an analyze-time error pointing at the absent column.
--cone and --columns compose; specify both for a tight, fast
download of "just what I need".
Margin caches¶
By default, the catalog's margin cache (if it has one) is downloaded
alongside the main table. This is what you want — without the margin,
the catalog will be rejected as the right side of any XMATCH(...)
(see the margin caches guide).
To skip the margin (e.g. when you only plan to use the catalog as a
left-side anchor or only filter by columns, never crossmatch), pass
--skip-margin:
You can always build one later with acid hats build-margin /data/gaia_dr3
— see the margin caches guide.
Preview before downloading¶
If you're not sure how big the download will be (especially with a broad cone or many columns), preview first:
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass \
--cone 50,-50,2 --estimate
--estimate prints the partition count and approximate byte count
and exits without writing anything. Combine with --prefetch-metadata
to force the exact (rather than approximated) byte count — this fetches
the _metadata file, which can be hundreds of MB on wide catalogs,
so the trade is precision versus a small extra metadata fetch.
For a one-time look at a remote catalog (without committing to
download), acid inspect reads only the metadata:
acid inspect https://data.lsdb.io/hats/two_mass/two_mass # summary
acid inspect schema https://data.lsdb.io/hats/two_mass/two_mass # column types
acid inspect properties https://data.lsdb.io/hats/two_mass/two_mass # raw properties
Parallelism, timeouts, and HTTPS verification¶
| Flag | Default | When to change |
|---|---|---|
--workers N |
8 |
Raise on a fast pipe with few parallel users; lower on shared / rate-limited servers. |
--timeout SEC |
300 |
HTTP only. Raise for very slow links; lower if you want failures to surface fast. |
--insecure |
off | HTTPS only. Set when the server uses self-signed certs (private mirrors, testing). Do not use against the public mirrors. |
--tmpdir DIR |
$TMPDIR |
Set to fast local storage when DEST is on a networked filesystem — the point_map.fits mmap is built here. |
Downloads complete or fail loudly¶
This is a deliberate design property: acid download never produces a
half-downloaded HATS tree that looks structurally valid. A directory
with a dataset/ folder, some parquet files, and a partition_info.csv
looks like a catalog to the registry — and a half-populated one will
silently miss data when you query it. That failure mode is invisible
until you compare against the truth, which is precisely the bug an
astronomer cannot afford.
Concretely:
- Worker exceptions are not collected and logged; the first failure aborts the download and cancels queued transfers. The CLI exits non-zero. Partial files on disk are reusable on retry.
- Re-running the same
acid downloadcommand after a hard failure skips files already on disk and resumes — retry is cheap. - If
--prefetch-metadatawas used, the_metadatafile is the last thing written, so a successful run guarantees a consistent view.
If you need "best-effort, take what I can get" semantics, you'll have
to wrap acid download yourself and handle the per-file errors —
the CLI itself never offers that, because the rest of the docs would
have to caveat every step with "...unless your download was partial".
Adding the downloaded catalog to a Connection¶
The output of acid download is a valid HATS directory, so the
simplest pattern is to drop it in a directory that's already on your
connection's roots. A by-name download already does this — it lands
under your first writable ACID_PATH root — so it's discovered by a
plain directory walk:
then
acid.init("/data/hats") # the ACID_PATH root
cat = acid.open("two_mass") # discovered by directory walk
An explicit-URL download lands wherever you point its destination:
Or use a YAML / inline registry (see Catalogs and the registry) to name it explicitly.
Finding catalogs from Python¶
The same discovery is available without the CLI, through
acid.archives.search():
import acid
for cat in acid.archives.search("gaia"):
print(cat.name, cat.margins_arcsec, "shadowed" if cat.shadowed else "")
It reads the configured download path (it does not need a Connection),
returns one entry per occurrence across the roots, and respects the same
~1-hour remote cache (cache="refresh" re-crawls; cache="off"
bypasses). See the
API reference for the full
signature and the CatalogInfo fields.
See also¶
- CLI reference —
acid search— the canonical flag list for discovery. - Catalogs and the registry — how
acidfinds the downloaded catalog once it's on disk. - Margin caches — what gets downloaded with
--skip-marginoff, and how to build one for a catalog whose margin cache is missing or too narrow. - CLI reference —
acid download— the canonical flag list. - Troubleshooting — SSH retry behavior,
_metadatasize warnings, "I downloaded a catalog but the analyzer says it has no margin cache".