Skip to content

Downloading catalogs

acid download fetches a HATS catalog (or a slice of one) from a remote source to local disk, in a HATS-valid layout you can hand straight to acid.init("./that/dir/"). It supports HTTP, SSH, and local-filesystem sources, with spatial- and column-subset modes for trimming the download to what you actually need.

This page covers the common cases:

  • Finding which catalogs are available before you download.
  • Fetching a full catalog by name, or by explicit URL.
  • Subsetting by cone (only download partitions overlapping a sky region) and by columns.
  • The hard guarantee that downloads either complete or fail loudly — no half-downloaded catalogs that look structurally valid.
  • Inspecting a remote catalog before committing to the download.

The full flag set with defaults is in the CLI reference; this page is the task-shaped overview.

Finding catalogs

Before downloading, list what's available with acid search:

acid search
gaia_dr3     margins: 10, 300 arcsec
two_mass     margins: 5 arcsec
wise/allwise margins: none
ztf_dr22     margins: 10 arcsec

Each line is one catalog you can hand straight to acid download. The name on the left is the download token; the margins: column lists the margin-cache radii (in arcseconds, ICRS) available for that catalog.

Check the margins before you download

A margin cache is what makes boundary-crossing crossmatches correct, and you can only crossmatch against a catalog whose margin cache is at least as wide as your match radius (see the margin caches guide). The margins: column lets you confirm a wide-enough margin exists before committing to the download — margins: none means you'll have to build one yourself with acid hats build-margin afterward.

acid search crawls every root on the download path over its native transport — local directories, ssh:// hosts, and http(s):// mirrors alike. Pass a substring to filter (case-insensitive):

acid search gaia        # only catalogs whose name contains "gaia"
acid search wise        # surfaces wise/allwise, wise/catwise, ...

Catalogs grouped under a namespace directory on the mirror surface as namespace/child (e.g. wise/allwise). That two-part name is exactly the token acid download expects — see Downloading by name below.

HATS collections are just catalogs here

A catalog published as a HATS collection (a directory bundling the primary table with its margin caches) shows up in acid search as a plain catalog name — the "collection" structure is read internally to find the margins but never surfaced. You download it by its name like any other.

Listings are cached

Remote (http / ssh) listings are cached under $XDG_CACHE_HOME/acid/downloads for about an hour, so a repeated acid search over a slow mirror returns instantly. Local roots are always crawled live. To force a fresh crawl:

acid search --cache refresh   # re-crawl, rewriting the cached listing
acid search --cache off       # bypass the cache entirely (read and write)

On a terminal, acid search prints an aligned, colored table with a live spinner while it crawls. When piped, it switches to clean tab-separated output — name⇥margins⇥root⇥shadowed-marker — so it composes with the usual shell tools:

# Names only, of catalogs that acid download would actually fetch
# (the 4th column, the shadowed marker, is empty).
acid search | awk -F'\t' '$4 == "" { print $1 }'

The fourth column and the shadowed marker matter when your download path has more than one root — see Shadowing.

The shortest download

Downloading by name

The simplest download is by name — whatever acid search listed, you can fetch:

acid download gaia_dr3

That resolves gaia_dr3 against the download path, fetches the full catalog (including its margin cache), and lands it under your local catalog path as <ACID_PATH>/gaia_dr3, re-openable by that bare name. A nested name lands under its leaf segment — the namespace is a remote locator, not part of the local identity:

acid download wise/allwise          # → <ACID_PATH>/allwise

You can name the destination explicitly as a second argument when you want it somewhere specific:

acid download gaia_dr3 /data/hats/gaia_dr3

Downloading by explicit URL or path

You can also point acid download straight at a source, bypassing name resolution. An explicit source is anything that is a URL or a filesystem path (a leading ./, /, or ~):

acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass

acid download accepts three explicit source kinds:

Source Example
HTTP / HTTPS https://data.lsdb.io/hats/two_mass/two_mass
SSH user@server:/sdf/data/hats/gaia_dr3
Local path /mnt/mirror/hats/two_mass

An explicit source requires an explicit destination — there's no catalog name to derive one from.

To copy from a local directory, prefix it with ./ or /

A bare token like gaia_dr3 is always treated as a name to look up on the download path, never as a local directory. To copy from a relative local directory, give it a leading ./ (./gaia_dr3) or an absolute path (/mnt/mirror/gaia_dr3) so acid download uses it verbatim instead of searching the mirror for a catalog called gaia_dr3.

SSH uses the ssh subprocess, which means ~/.ssh/config (and its ProxyJump / aliased hosts) is honored. paramiko and fsspec do not honor ~/.ssh/config, so acid download uses the subprocess shape deliberately.

The download path

When you download by name, the name is resolved against the download path — a first-wins list of roots, configurable via ACID_DOWNLOAD_PATH, an acid.conf download_path = line, or the built-in default. Out of the box that default is two roots, searched in order:

  1. https://data.lsdb.io/hats/ — the public LSDB HATS mirror.
  2. ssh://slacd/sdf/home/m/mjuric/datasets — the SLAC datasets dir.

The first root that has a catalog of the requested name wins.

When a name resolves against no root, the error is a per-root diagnostic rather than a bare "not found": each root on the download path is listed with its outcome — no match, unreachable: <reason>, or malformed collection — plus a did you mean '<closest>'? suggestion drawn from cached / local listings, and a pointer to acid search. A root that errored (an SSH host that's down, a connection refused, a 5xx from a mirror) is called out as unreachable rather than silently treated as missing — the catalog may exist there but couldn't be checked, so fix or re-try the connection before assuming it's gone. See Troubleshooting — Finding & resolving catalogs for the full walkthrough.

When roots shadow each other

When more than one root holds a catalog of the same name, acid search shows every occurrence but flags the later ones — acid download resolves first-wins, so a later copy is shadowed: it would never be the one fetched by name. On a terminal the shadowed row carries a trailing *; when piped, the fourth TSV column reads shadowed. The summary line reports the count, e.g. ✓ 12 catalogs (2 shadowed), with a footnote explaining the marker. To actually fetch a shadowed copy, download it by its explicit URL or path instead of by name.

After downloading

Once a download finishes, the destination is itself a valid HATS catalog directory — point acid at it and you can query immediately:

import acid

acid.init("/data/hats")          # the ACID_PATH root the download landed under
print([c.name for c in acid.list_catalogs()])
cat = acid.open("gaia_dr3")
print(cat.head(10).to_astropy())

Subset downloads

Most users do not need the whole catalog — a science question only touches a small patch of sky and a handful of columns. Two flags trim the download:

--cone RA,DEC,RADIUS_DEG — spatial subset

Only partitions overlapping the cone are downloaded. ICRS, degrees.

acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass \
    --cone 50,-50,2

The resulting /data/two_mass/ is a valid HATS catalog whose partition_info.csv and _metadata reflect only the downloaded partitions; a point_map.fits is regenerated from the subset so the analyzer's footprint pruning still works locally.

--columns COL,COL,... — column subset

Only the named columns plus the HATS-required ones (RA, Dec, _healpix_29) are pulled.

acid download user@server:/hats/gaia /data/gaia \
    --columns ra,dec,phot_g_mean_mag,parallax

This is dramatic on wide catalogs: a Rubin object catalog with 1 250 columns downloaded with three columns of interest pulls a few percent of the bytes. The catalog still works for queries over those columns; queries referencing missing columns raise an analyze-time error pointing at the absent column.

--cone and --columns compose; specify both for a tight, fast download of "just what I need".

Margin caches

By default, the catalog's margin cache (if it has one) is downloaded alongside the main table. This is what you want — without the margin, the catalog will be rejected as the right side of any XMATCH(...) (see the margin caches guide).

To skip the margin (e.g. when you only plan to use the catalog as a left-side anchor or only filter by columns, never crossmatch), pass --skip-margin:

acid download https://data.lsdb.io/hats/gaia_dr3 /data/gaia_dr3 \
    --skip-margin

You can always build one later with acid hats build-margin /data/gaia_dr3 — see the margin caches guide.

Preview before downloading

If you're not sure how big the download will be (especially with a broad cone or many columns), preview first:

acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass \
    --cone 50,-50,2 --estimate

--estimate prints the partition count and approximate byte count and exits without writing anything. Combine with --prefetch-metadata to force the exact (rather than approximated) byte count — this fetches the _metadata file, which can be hundreds of MB on wide catalogs, so the trade is precision versus a small extra metadata fetch.

For a one-time look at a remote catalog (without committing to download), acid inspect reads only the metadata:

acid inspect https://data.lsdb.io/hats/two_mass/two_mass            # summary
acid inspect schema https://data.lsdb.io/hats/two_mass/two_mass      # column types
acid inspect properties https://data.lsdb.io/hats/two_mass/two_mass  # raw properties

Parallelism, timeouts, and HTTPS verification

Flag Default When to change
--workers N 8 Raise on a fast pipe with few parallel users; lower on shared / rate-limited servers.
--timeout SEC 300 HTTP only. Raise for very slow links; lower if you want failures to surface fast.
--insecure off HTTPS only. Set when the server uses self-signed certs (private mirrors, testing). Do not use against the public mirrors.
--tmpdir DIR $TMPDIR Set to fast local storage when DEST is on a networked filesystem — the point_map.fits mmap is built here.

Downloads complete or fail loudly

This is a deliberate design property: acid download never produces a half-downloaded HATS tree that looks structurally valid. A directory with a dataset/ folder, some parquet files, and a partition_info.csv looks like a catalog to the registry — and a half-populated one will silently miss data when you query it. That failure mode is invisible until you compare against the truth, which is precisely the bug an astronomer cannot afford.

Concretely:

  • Worker exceptions are not collected and logged; the first failure aborts the download and cancels queued transfers. The CLI exits non-zero. Partial files on disk are reusable on retry.
  • Re-running the same acid download command after a hard failure skips files already on disk and resumes — retry is cheap.
  • If --prefetch-metadata was used, the _metadata file is the last thing written, so a successful run guarantees a consistent view.

If you need "best-effort, take what I can get" semantics, you'll have to wrap acid download yourself and handle the per-file errors — the CLI itself never offers that, because the rest of the docs would have to caveat every step with "...unless your download was partial".

Adding the downloaded catalog to a Connection

The output of acid download is a valid HATS directory, so the simplest pattern is to drop it in a directory that's already on your connection's roots. A by-name download already does this — it lands under your first writable ACID_PATH root — so it's discovered by a plain directory walk:

acid download two_mass            # → <ACID_PATH>/two_mass

then

acid.init("/data/hats")            # the ACID_PATH root
cat = acid.open("two_mass")        # discovered by directory walk

An explicit-URL download lands wherever you point its destination:

acid download https://data.lsdb.io/hats/two_mass/two_mass /data/hats/two_mass

Or use a YAML / inline registry (see Catalogs and the registry) to name it explicitly.

Finding catalogs from Python

The same discovery is available without the CLI, through acid.archives.search():

import acid

for cat in acid.archives.search("gaia"):
    print(cat.name, cat.margins_arcsec, "shadowed" if cat.shadowed else "")

It reads the configured download path (it does not need a Connection), returns one entry per occurrence across the roots, and respects the same ~1-hour remote cache (cache="refresh" re-crawls; cache="off" bypasses). See the API reference for the full signature and the CatalogInfo fields.

See also