Changelog¶
All notable changes to acid will be recorded here. The format
follows Keep a Changelog; the
project follows Semantic Versioning within the
constraints of its alpha posture (see CLAUDE.md).
[0.5.0a1] — 2026-06-15¶
Added¶
acid list— the local twin ofacid search. Lists the catalogs in the catalog path ($ACID_PATH→ configpath→ built-in default~/datasets) — the onesacid open/acid queryresolve a bare name against — instead of the download path. Same discovery engine, transports (local /ssh:///http(s)://), flags (PATTERN,--cache,--timeout,--insecure,--no-color), and output contract asacid search(aligned table on a TTY,name⇥margins⇥root⇥markerTSV when piped, margin radii,namespace/childnames, shadowing); the only divergence is the shadowing footnote, which namesacid open(which resolves the first match) rather thanacid download. Soacid searchanswers what can I download andacid listanswers what's already here that I can open.
Changed¶
acid.list_catalogs()/Connection.list_catalogs()now return the same results as theacid listCLI. Previously they did a shallow local-onlyos.listdirwalk that returned alist[str]of basenames — which silently droppedssh:///http(s)://roots inACID_PATH, missed catalogs under namespace directories, and listed margin-cache siblings (e.g.object_2arcsec) as if they were catalogs. They now go through the same discovery engine asacid listand return alist[acid.tools.download.CatalogInfo]— rows(name, margins_arcsec, root, shadowed)— over local / ssh / http roots, withnamespace/childnames, margin-cache siblings attributed to their parent, and cross-root shadowing flagged. Explicitly registered catalogs (YAML /register_catalog) are still included (a superset of the CLI). This is a breaking change: callers using the name-membership idiom ("gaia" in acid.list_catalogs()) must use{c.name for c in acid.list_catalogs()}.- Renamed the catalog-discovery row type
FoundCatalog→CatalogInfo. It's now returned by bothacid list/acid.list_catalogs()(catalog path) andacid search/acid.archives.search()(download path), so the download-flavoredFoundCatalogname no longer fit. Same fields(name, margins_arcsec, root, shadowed); import asfrom acid.tools.download import CatalogInfo.
[0.4.0a4] — 2026-06-14¶
Fixed¶
acid searchno longer blocks or hangs on an unreachable download root. The default download path includes anssh://root, so a user who hasn't set up that host (e.g.slacdnot in their~/.ssh/config) previously hadacid search(andacid.archives.search()) either sleep ~19 s through the download path's 6-attempt MaxStartups backoff or — for a resolvable but firewalled host — hang indefinitely (no connect timeout), and in both cases the failure aborted the whole search, discarding catalogs already found at the other roots. Now:sshruns withBatchMode=yes(fail instead of blocking on a password / passphrase prompt) andConnectTimeout(bound the TCP connect), so a dead host fails in seconds, not minutes;- discovery probes use a single attempt (
retries=1) instead of the download hot loop's backoff schedule; - a download path is treated as a search path — a root that can't be
crawled is skipped (reported via
search_downloads(on_error=…)/ aUserWarning, and as a! skipped <root>line by the CLI), never fatal, so the other roots' catalogs still come back; - connection readiness is now a positive handshake rather than a 150 ms
timing heuristic: the remote server emits a one-byte ready marker once ssh
has authenticated and the remote python has started, and
SshSessionblocks on it. A failed connect / auth / host-key check closes the pipe (EOF), so the failure surfaces at connect time (__init__) instead of being deferred to the first later read. This also fixes a latent misclassification where a remote key rejection was reported as "catalog not found" (absent) instead of "unreachable" (error) bySshFetcher.fetch_text_ex, and makes the retry/backoff actually apply to auth/slow-kex failures (the old heuristic returned "connected" before they happened); - the skip is classified into an actionable hint (
ssh_failure_hint): a rejected key prints→ the host rejected authentication — set up key-based SSH access (ssh-copy-id <host> / ssh-add), with distinct hints for an untrusted host key, an unresolvable name, a connect timeout, and a refused connection. The hint is derived from the sshstderr(every ssh failure exits 255), and is detected at both the connect (__init__) and first-read (scan_dir) sites, since a remote auth rejection surfaces only after the connection looks established.
[0.4.0a3] — 2026-06-14¶
Changed¶
- The HATS spatial index (
_healpix_29) is no longer surfaced in query output. It's an internal HATS format detail, not data the user asked for, so it no longer appears inSELECT */a.*,Catalog.columns/Result.column_names, or any materialized output (to_polars/to_arrow/to_astropy/to_pandas/print/show): - the root catalog's index is hidden but kept physically — an explicit
select("…, _healpix_29")/WHERE _healpix_29 …resolves it,save/--output hatsstill write it to disk (the index is required for a valid, re-queryable catalog), andacid inspectstill reveals it in the on-disk schema; - a joined right table's index (
_healpix_29_<alias>) is dropped entirely after a join/crossmatch — it's meaningless in the result (which is partitioned by the root's index) and is neither selectable nor written. print(result)now renders the result as a PolarsDataFrame(viaResult.__str__) — itsshape: (rows, cols)header plus Polars's own head/tail row truncation — instead of the previous fixed-width first-20-rows ASCII table.Result.show(n)(and theacid queryCLI) keep the fixed-width renderer. Noteprintmaterializes the full result; useshow(n)for a bounded peek at a large on-disk result.
[0.4.0a2] — 2026-06-14¶
Removed¶
Result.save()is gone (breaking, no shim). AResultis already-materialized data — it has left the partitioned system — and its in-memory branch could only write a degenerate single-partition catalog (all rows mislabelled into pixel(0,0), no spatial index). HATS output is a stays-in-the-system operation written from the lazy handle that can stream it correctly:Catalog.save(path, name=...), oracid.sql.query(query, output="dir/")/acid query --output dir/for the SQL surface.Resultkeepsexport(path)(one flat file) and theto_*converters. This sharpens the terminal split —saveisCatalog-only;Resultonly leaves the system (amends the API-DESIGN V4/V5 Catalog/Result symmetry deliberately).
Fixed¶
- A saved catalog now always keeps its HEALPix spatial index, even when a
projection dropped it.
db.open("x").select("source_id, designation").save(...)used to write a partition-only catalog with no_healpix_29column; re-opening it and running a query under many workers crashed (TypeError: unsupported operand type(s) for -: 'NoneType' and 'int') when the work-tuple planner tried to sub-partition a catalog it couldn't row-restrict. The HATS spatial index is now treated as a required column of a HATS catalog: a query whose result stays in the system (Catalog.save/acid query --output hats) retains_healpix_29in its terminal projection regardless of theselect, so the written catalog stays valid and re-queryable. The leaving-the-system terminals (export,to_polars/to_arrow/to_astropy/to_pandas, display) are unchanged — a flat extract omits the index unless you select it explicitly. Catalog.save/acid query --output hatsnow declarehats_col_healpix/hats_col_healpix_orderin the output catalog'sproperties, so re-registration is HATS-spec-compliant and no longer relies on the reader's_healpix_<order>column-name auto-detection.- The catalog registry no longer stores the literal string
'None'forhpix_colwhen a catalog has no HEALPix column (it now storesNone).
[0.4.0a1] — 2026-06-13¶
Changed¶
Resultnow mirrorsCatalog's terminal verbs (API-DESIGN V4/V5 — one concept, one name, on both nouns; breaking, no shims):Result.arrow()→Result.to_arrow();Result.write(path, format=)→Result.export(path, format=)(same contract asCatalog.export, including theValidationError-pointing-at-savedestination checks);Result.write_parquet(layout="hats")→Result.save(path)(a HATS tree, withoutCatalog.save's name registration). The per-format writers (write_parquet(layout="single")/write_csv/write_fits) and thedf()pandas alias are removed — useexport(path)(format by extension) andto_pandas().add_catalog→register_catalog(Connection method + module-level delegate): one prefix for "make a name known to the connection" (register_catalog/register_file/register_moc).acid.register_filegained the missing module-level delegate (API-DESIGN O3).agg.bool_and/agg.bool_or→agg.all/agg.any— the numpy idiom, likemean/std/var(API-DESIGN A1). SQL-stringBOOL_AND(...)/BOOL_OR(...)indb.sqlare unchanged.acid query --out→--output— one flag spelling for one concept across subcommands, mirroring theConnection.sql(output=)kwarg (API-DESIGN P2). The abbreviation--outstill parses (argparse prefix matching).Connection(workers=None)is the new default spelling of "resolve for me" (wasworkers="auto";"auto"is still accepted) —Noneis the one inherit sentinel across the API (API-DESIGN S2).-
Result.to_arrow()no longer imports DuckDB for disk-backed results — the partition union now streams through the same manifest-driven PyArrow path asResult.batches(). DuckDB is again strictly a test-only dependency ("one engine"). -
acid search/acid.archives.searchcache control is one keyword:cache="use"|"refresh"|"off"(CLI--cache) replaces the overlappingrefresh=/use_cache=boolean pair and the--refresh/--no-cacheflags;archives.searchalso gained thetimeout=/insecure=/workers=crawl knobs the CLI already exposed (API-DESIGN P3 — no CLI-only capability). build_margin_cachenow fails loudly on the first partition failure (parallel scan): the run aborts with the failing partition named instead of printing per-partition FAILED lines and exiting 0 over an incomplete margin cache (API-DESIGN E1/E5).
Added¶
-
Work-tuple subdivision gate (autosize, decision #20) — a query over a handful of partitions (e.g. a small cone) used to coalesce to a handful of work tuples and leave most workers idle, because the parallelism floor is clamped to physical-partition granularity. When the layout yields fewer than
workers/2tuples, the driver now re-enumerates with the floor allowed below partition granularity, aiming for ~workerstuples so idle cores get fed. Fires only in the under-subscribed regime (the billion-row/thousands-of-partitions path is byte-for-byte unchanged); the cost is duplicated central+margin reads per sub-cell (the OS page cache keeps the physical I/O ~1×). -
acid hats build-margin --tmpdir(andbuild_margin_cache(tmpdir=...)) — redirect the accumulator's spill scratch to fast local storage when--outputlives on a slow networked filesystem (a unique subdir is created and removed; the base must already exist, fail-loud). -
API-DESIGN.md— the API design language (new root authoritative doc). The prescriptive principles governing the public Python surface — the two-noun object model, the composition / materialization / introspection verb taxonomy, fail-loud error rules, astronomer idiom + explicit units, signature and config-knob conventions — plus the CLI-as-projection corollary. Normative, not descriptive: new or changed public API must pass its §13 checklist (citable rule IDs) or amend the rule in the same change. Added toCLAUDE.md's trusted-docs list. acid search [pattern]— discover catalogs available to download. Enumerates the catalogs under the download path over every transport (local /ssh:///http(s)://) and prints one line per catalog with the margin-cache radii available for it (gaia_dr3 margins: 10, 300 arcsec). Catalogs nested under namespace directories surface asnamespace/child(wise/allwise); HATS collections are presented as plain catalogs (the "collection" concept is never shown). Roots are merged in download-resolution order: a same-named catalog at a later root is still listed but flagged (a trailing*/ ashadowedTSV column), sinceacid downloadresolves first-wins. Output is an aligned, colored table on a TTY (with a live spinner while crawling) and parseable TSV when piped. Remote listings are cached under$XDG_CACHE_HOME/acid/downloadsfor ~1h (--refreshre-crawls,--no-cachebypasses); local roots are always live. The Python sibling isacid.archives.search(pattern=None, *, cache="use", timeout=300.0, insecure=False, workers=16)→list[FoundCatalog](the same knobs as the CLI flags;cacheis"use"/"refresh"/"off").acid downloadaccepts nestednamespace/childnames. A name likewise/allwise(as shown byacid search) now resolves against the download path and lands locally under its leaf (<ACID_PATH>/allwise). Only a leading.////~path or a URL is treated as an explicit source now (an internal/no longer forces verbatim use — prefix./to copy a local relative dir).- The built-in default download path gained a second root. It is now
https://data.lsdb.io/hats/thenssh://slacd/sdf/home/m/mjuric/datasets, searched first-wins. (default_download_path()now returnslist[str].) ACID_QUERY_LOG=<file.csv>— a per-tuple query execution log (debug aid;acid/engine/querylog.py). When set, every executed work tuple appends one CSV row —timestamp,worker_id,norder,npix,nrows— with the originating query written as#-prefixed comment lines before the header (SQL verbatim; the fluent path logs a rendered pipeline summary, now also attached toOpPlan.queryso fluent error messages and manifests carry a description). The file is truncated per query and written only by the parent process (workers stampworker_id/exec_timeonto thePartitionResultthey already return; the single-threaded result collector writes the row), so it is correct on local disk, NFS, and Weka alike — no concurrent-O_APPENDrace. Read once at import; a no-op when unset. In-process (workers<=1) rows carryworker_id=-1.Catalog.export(path, *, format=None, progress=None) -> Path— a flat-file terminal verb, the leaves-the-system counterpart tosave's stays-in-the-system HATS write (spec provenancedocs/archive/EXPORT-API.md). Sugar overexecute().write(...): it gathers the full result in RAM and writes one CSV / parquet / FITS file (format by extension orformat=), returning the writtenPath. A no-extension / unknown-extension /format="hats"call is aValidationErrorpointing atsave—exportnever writes HATS.save()now joins the catalog library for a bare name. A bare destination (no/, e.g.save("gxt")) lands under the first writable localACID_PATHroot, so the name is durably re-openable in a later session (acid.open("gxt")/... FROM gxt) with no path bookkeeping — the same modelacid download <name>uses. Explicit paths (./x,/abs/x,~/x) stay verbatim/cwd-relative. A bare name shadowed by an existing catalog earlier onACID_PATHis a hardRegistryError(overwrite=Truedoes not override it).acid query --out <bare name>(hats) shares the rule. The destination machinery lives inacid/api/_dest.py(shared byCatalog.save,acid query, andacid download).- RAM-budget work-tuple sizing (autosize) — the new default enumeration
(key decision #20; spec provenance
docs/archive/WORK-AUTOSIZE.md). Work tuples are no longer dictated by the on-disk HATS layout: a byte-budget quadtree overpoint_map.fitsrow-count maps (engine/autosize.py) picks the coarsest HEALPix cells whose estimated resident bytes fitram_budget / workers— coalescing small physical partitions into one tuple (concatenated scans) and splitting oversized ones (cursor row filters), with count-based pruning (cells without root rows or INNER partners within the locality band emit nothing) and a two-level threshold (the parallelism floor never splits below physical partition granularity; only the RAM ceiling does — the OOM protection). Verified result-identical to the legacy enumeration across a 10-shape oracle sweep at three granularities and 23 exact-equality real-data validation tests; the test suite's wall clock halved under it. ram_budget— the one new knob:acid.confkey,ACID_RAM_BUDGET,acid.init(ram_budget="64GB"), CLI--ram-budget. Bytes or human sizes (64GB,512MiB,32g); default 0.25 × available RAM (cgroup-aware).ACID_WORK_AUTOSIZE=0selects the legacy layout-driven enumeration (A/B + verification switch; slated for deletion once autosize is validated at scale).- Every acid HATS output now writes
point_map.fits(exact counts when the healpix column survives the projection; partition-painted counts otherwise), so outputs can always be re-registered and queried.
Fixed¶
- A catalog-resolution probe error no longer masquerades as "not found."
Resolving a bare catalog name probes each download/catalog-path root; a probe
that errored (an unreachable SSH host, a connection timeout, a 5xx) was
previously swallowed and indistinguishable from "the catalog isn't here," so a
catalog that genuinely lives on a momentarily-unreachable root could be
reported as a flat "not found." The fetcher probe is now tri-state
(
fetch_text_ex→FetchResult(text, error)): cleanly absent (404/403/410, missing local file, connected-but-no-file SSH) is distinguished from a transport failure, and resolution reports the per-root outcome instead of flattening an unreachable root to "absent."
Changed¶
- Friendlier "catalog not found" errors across
acid download,acid inspect, andacid query/acid.open. The error now shows a per-root trail (which roots had no match, and which were unreachable — with the reason), adid you mean '<closest>'?suggestion drawn from already-cached/local listings (never a fresh crawl), and an actionable next step (acid searchto find a downloadable catalog,acid download <name>to fetch one). Forquery/open, a name that's available to download is called out as such. acid downloadover SSH no longer shells out torsync. The whole-file SSH download path now streams over the existingSshSessiontransport (thesshsubprocess, so~/.ssh/configaliases /ProxyJumpare honored — unlike paramiko/fsspec), the same backend the column-subset path already used. Three consequences: (1) the download progress bar fills smoothly within each file (byte-level, with a real ETA) instead of jumping one whole file at a time; (2)rsyncis no longer required on the client for SSH downloads; (3) each transfer streams to a.tmpsidecar, verifies the full byte count landed, then atomically renames into place — so an interrupted transfer (Ctrl-C, dropped connection) never leaves a passing-but-truncated partition under its final name and a re-run cleanly re-fetches it (closing thersync --partial+ size-only-check hole that let a half-downloaded catalog look complete and fail later at metadata rebuild, decision #13).RsyncFetcheris removed;make_fetcherreturns the newSshFetcherforssh://sources.- SQL-string entry points moved under the
acid.sqlsubmodule (alpha rename, no shim).acid.sql(query)is nowacid.sql.query(query);acid.validate/acid.explainare nowacid.sql.validate/acid.sql.explain.acid.sqlis a real module, not a callable — the top-levelacid.sql(...)/acid.validate(...)/acid.explain(...)functions are gone. The fluentCatalogAPI and theConnection.sql/.validate/.explainmethods are unchanged. Connection.map_partitions_sqlis now private (_map_partitions_sql). The phase-1-only inspection hook is an internal power-user/debug tool, not part of the public surface; the top-levelacid.map_partitions_sqldelegate is removed with no replacement.save()rejects a single-file extension.save("out.csv")(any recognized flat-file extension) is now aValidationErrorpointing at.export(...), instead of silently writing a HATS directory literally namedout.csv. Pass a trailing slash (save("out.csv/")) to force a HATS tree genuinely named that.acid query --out x.csv --format hats(the explicit-conflict case) is the aligned CLI error.copartitioned=→localized=ongroup_by()(alpha rename, no shim): the assertion and the equi-join contract are the same spatial statement — rows sharing a key are localized to within the margin-cache radius — and now share one word.- Equi joins assert locality and require the RHS to carry
_healpix_29and a declared margin cache (radius 0 = exact-pixel); position-less lookup tables use the in-memory broadcast join. Margin completeness validation (_band_budgets) now counts equi edges at their margin radius for deeper right-subtree leaves. point_map.fits(with real row counts) is required for every HATS catalog participating in a query under autosize; a missing map is a clearValidationErrornaming the catalog.- Plan validation (
validate_ops) runs at compile time in both frontends (the lowering re-validates as a backstop), so structural/contract errors surface asValidationErrorat composition, not worker-sideExecutionError. WorkTupleis now(n_cur, p_cur, leaf_scan)with the root folded in as slot 0 — oneLeafScancontract for every leaf.acid downloadtext UI reworked to the CLI design language (cli_text_ui_design_language.html). The command renders a calm, scannable pipeline: a one-time ACID hello banner, one primary progress indicator at a time (an indeterminate⠋braille spinner, or a▰▱progress bar once a denominator is known), aligned✓step lines ([state] [object ~28 cols] [metric] [detail]), semantic color, and a final alignedSummaryblock (partitions / output size / elapsed / throughput). A partial download renders a✗reason+next-action error block before failing hard (downloads still never exit 0 with a half-built catalog). Pretty on a TTY; durable, parseable committed lines when piped (no animation);ACID_PROGRESS=offsilences it. New flags:--quiet(only the summary),--verbose(per-item detail), and--progress {auto,on,off,plain}(mirrorsacid query). The download bar advances by the fraction of the current file streamed — not one jump per completed file — so a large partition file fills smoothly (HTTP whole-file via a chunked read +Content-Length; SSH column subsets via per-chunkbulk_readprogress); each file still contributes exactly one unit (retry-safe), so thek/N filesreadout stays exact. The HTTP column-subset path (pyarrow-driven Range reads, no byte hook) keeps one advance per completed file. Driven by a reusablePipelineReporter/Stepinacid.io.progress— a sibling of the query path'sRichReporter, with spec-exact glyphs (the two frontends deliberately don't share a look).
[0.3.0a1] — 2026-06-10¶
Added¶
- Module-level API (singleton-by-default). A headline surface mirroring
Ray / DuckDB / Polars — no
with-block teardown per use:
import acid
acid.init("./data", workers=8) # optional — first acid.open() lazy-inits
df = acid.open("gaia").head(100).to_polars()
df2 = acid.sql("SELECT ... FROM a JOIN b ON XMATCH(...)").df()
acid.shutdown() # optional — atexit handles it
init() is fingerprint-matched (same config → no-op; a different config →
ConfigError unless reuse_existing=True). Module-level
open/sql/map_partitions_sql/add_catalog/register_moc/in_cone/
list_catalogs/validate/explain/status all delegate to one shared
default Connection (lazy-built on first use, torn down at exit).
acid.configure(progress=...) sets process-wide display defaults without
rebuilding the pool. acid.Connection(...) is the explicit-isolation escape
hatch (two simultaneous connections / two configs in one process); use it as
a context manager. (ACID-MODULE-API.md.)
Catalog.join(<frame>, on=...)— broadcast equi-join against an in-memory table.join's operand may now be a polars / pandas / numpy-structured / pyarrow / astropy frame, not just anotherCatalog— for attaching a flat id→value lookup whose key-matching rows aren't spatially co-partitioned:
labels = pl.DataFrame({"source_id": [...], "class": [...]})
db.open("gaia_dr3").where("phot_g_mean_mag < 18").join(labels, on="source_id")
The frame is spilled once to a broadcast (non-spatial) virtual catalog —
one memory-mapped Arrow IPC file, no coordinates / no _healpix_29 / no
partitions — and read whole into every work tuple, then Polars-hash-joined
on the integer key: partition-local, no Exchange, no reshuffle; INNER and LEFT
both supported; the result stays partitioned by the root. A broadcast table is
tuple-independent, so the engine short-circuits it to a whole-file scan
(_exec_source), bypassing the per-tuple scope machinery. A frame has no
position, so it's a .join() RHS only — .crossmatch(<frame>) errors (open it
with db.open(frame, ra=, dec=) for a spatial match). Deferred: string keys,
nested=True over a frame, and a db.sql/CLI surface for broadcast tables.
acid query --openuses a raw file as a named table. A raw data file (.parquet/.csv/.fits/.arrow/ …) can now be referenced by name in a CLI query while every other table still resolves from--db/acid_path— it is spilled once to a virtual catalog and registered. Two forms, both requiring the ra/dec column names (never guessed): positionalPATH,RA,DEC(table name = file basename) or namedNAME=PATH,ra=RA,dec=DEC. Repeatable:
acid query "SELECT t.id, g.source_id FROM t JOIN gaia_dr3 ON XMATCH(radius_arcsec => 1.0)" \
--db /data/hats --open t=candidates.csv,ra=RA,dec=DEC
The Python counterpart is Connection.register_file(name, path, ra=, dec=) —
the registering sibling of db.open(<file>) (which returns a fluent Catalog
but does not register a name, so the SQL escape hatch can't see it).
db.open(...)opens raw files and in-memory frames as catalogs. Beyond a HATS directory,open()now accepts a raw data file (.parquet/.csv/.tsv/.arrow/.feather/.fits/.votable) or an in-memory frame (numpy structuredndarray/ pandas / polars / pyarrow / astropyTable) — the "bring my own RA/Dec target list" on-ramp, no offline HATS import:
db.open("targets.parquet", ra="ra", dec="dec").crossmatch(db.open("gaia_dr3"), radius=1*u.arcsec)
db.open(my_ndarray, ra="RA", dec="DEC") # …pandas / polars / astropy too
The source is spilled once at open() to a single memory-mapped,
uncompressed Arrow IPC file under the Connection's scratch dir (cleaned up on
close()): ra=/dec= name the coordinate columns and are required (no
column-name guessing); _healpix_29 computed; NULL/NaN-coord rows dropped with
a warning; the data adaptively partitioned (budget-first, area-first) so a
sparse target list doesn't over-read the crossmatch RHS. To the rest of the
pipeline it is an ordinary (if coarse) HATS catalog — a virtual catalog,
distinguished by TableSpec.backing ("hats" vs "ipc"); a virtual RHS's
coverage reuses the general central ∪ ring mechanism with the ring computed
on the fly. Usable as crossmatch root or operand, INNER or LEFT. Deferred (see
docs/archive/EXTERNAL-SOURCES.md §6): non-spatial id↔id broadcast, a size
cost-guard.
-
Virtual catalogs read per-partition, not whole-file (≈12–90× on the read path). The virtual backing file is now written one Arrow IPC record batch per partition, sorted by
_healpix_29, and the engine reads only the batch(es) a cursor / margin band overlaps (TableSpec.batches_for_ranges+pyarrow.ipc.get_batchover themmap) instead of scanning the whole file and filtering per partition (Arrow IPC has no min/max pushdown). Because the exact_healpix_29predicate still runs over whatever is read, batch selection is a pure, over-approximating read-pruning — it cannot change results. Measured (synthetic, 2048 partitions): the whole-file path growsO(P×|file|)(13.6 s → 34 s → 85 s at 100k → 1M → 4M rows) while the per-batch path stays ≈1 s — 11× / 29× / 92×. Gated byACID_VIRTUAL_BATCH_SLICING(default on; set0/false/no/offfor the whole-file path, to A/B the speedup on real catalogs). (docs/archive/EXTERNAL-SOURCES.md§2.2.) -
Single-aggregate reduction shortcuts on
Catalog—count/sum/mean/min/max/std/var. Each is sugar for a one-aggregate.aggregate(...)and is polymorphic in the precedinggroup_by: global (no group-by) materializes and returns a bare Python scalar (db.open("a") .where("mag<18").count()→int), while grouped returns the lazy chainable Catalog with the stat in acount/mean_<col>/ … column (so a following.where(...)is HAVING). No engine changes — it rides the existing global/grouped aggregate path.count()isCOUNT(*),count(col)the non-null count; the rest take one column. (FLUENT-REFERENCE.md§5.2;CLAUDE.md"What acid is".) -
Composable (bushy) crossmatches & joins. A
crossmatch/joinoperand may now itself be a full join/crossmatch sub-spine, not just a single catalog —a.crossmatch(b.join(c, on="objectId"), radius=1*u.arcsec), or the deeper-leaf association shapegaia.join(ztf.join(ztf2gaia, on="ztf_id"), on="gaia_id")(the outer equi key may bind any leaf of the operand). It stays one per-partition program — no shuffle — because partitioning is compositional: every sub-expression is HEALPix-partitioned by its leftmost leaf (anchor_source(join.right)). The fluent compiler is reentrant (_compile_spine+ a_SpineCtxthreading slot ids / aliases / regions); the per-leaf scan contract is slot-keyed (WorkTuple.leaf_scan, oneright_rowid(slot)for spatial and equi RHS leaves — the positionalxmatch-index machinery is gone); enumeration gathers every leaf. Margin completeness (§7): a leaf whose declared margin-cache radius is below the sum of the spatial match radii on its path to root is a hard error invalidate_ops— LEFT and INNER alike (a LEFT-path shortfall fabricates false "no counterpart" rows); this generalizes the single-join margin assumption (it fires on a flat crossmatch whose radius exceeds the right's margin too). v1 covers spatially-bounded RHS sub-spines; the broadcast path for non-spatial id↔id mappings is deferred. CLAUDE.md key decision #17; the design study isdocs/archive/COMPOSABLE-JOINS.md. -
FLUENT-REFERENCE.md— the canonical definition of the fluent language. A root, authoritative, present-tense reference for the "acid fluent language": concepts + the abstractOpalgebra + a formal grammar (verb chain + the embedded-SQL fragment) + the full verb reference + composition/ordering semantics + an exhaustive limitations table + a worked gallery. Aimed at both a third-party re-implementor and an agent writing fluent queries. Added toCLAUDE.md's authoritative-docs list;docs/archive/CATALOG-API.mdis now provenance. -
Feature A — Python partition functions (UDFs).
Catalog.with_columns(name, fn, *, columns=, schema=, mode="numpy")(single- or multi-column) andCatalog.map_partitions(fn, *, schema=, columns=None)(whole-table) run a user callable per partition as aMapoperator in theOpPlan, so NumPy / SciPy / Astropy work that doesn't fit a Polars/SQL expression (calibrations, period fits, SED matching, custom stats) is a first-class fluent step with the engine's parallelism and partition layout.columns=andschema=are required (no signature inference) —columns=drives projection pushdown across the UDF boundary (a function reading 3 of 100 columns reads only 3 from parquet, lazymap_batches, verified via.explain()), and the declaredschema=makes output names + dtypes known at compile time so.columns/ downstream verbs compose with zero engine I/O. NumPy is the default input mode (one.to_numpy()at the worker boundary);mode="polars"passespl.Seriesthrough (forpl.col(...).list.*on nested lists).map_partitionsis the table form (collect barrier; receives the wholepl.DataFrame); it is rejected before a latercrossmatch/join(it can rewritera/dec/_healpix_29).@acid.function(api/function.py) declares a reusable function's metadata (acid_columns/acid_schema/acid_mode) once at the definition site, and on a class turns it into a deferred-construction (_Factory/_Deferred) stateful UDF: the heavy__init__runs once per worker, the instance lives in a bounded per-process cache keyed by(cls, args)and never rides the pickle (so a 100 MB model can't leak into the payload).cloudpickleis a new hard dependency and is forkserver-preloaded, so closures / lambdas / decorated classes ship to workers; theOpPlancarries the cloudpickledUserFnSpecbytes once per query, never per task. Spec/plan archived atdocs/archive/FLUENT-EXTENSIONS-IMPL-PLAN.md; as-built inCLAUDE.md/ARCHITECTURE.md. (Thedocs/cookbook.mdrecipe prose is the one deferred remainder — seeFLUENT-FUTURE-EXTENSIONS.md.) -
Feature B — nested aggregation (nested crossmatch +
db.sql LIST). Anested=True(+order_by=) kwarg onCatalog.crossmatchfolds each anchor object's matches into per-objectlist<T>columns — one row per anchor, the anchor's own columns staying scalar — instead of duplicating the wide LHS for every match (the canonical DP1-style lightcurve shape, which otherwise OOMs on the output shape, not the join work). It compiles to a partition-local LISTAggregategrouped on the synthetic anchor rowid (Path A), so it runs phase-1-only with no phase-2 reduce — a hard guarantee (needs_global_reduce is False, asserted, never hand-set), the difference between a ~1-hour and a ~10-hour full-sky save.order_by=co-sorts every list column by one key (tie-stablemaintain_order=True), so element i is the same match across columns.how="left"unmatched anchors get a one-element[null]list (not[]). The general cross-partition form isdb.sql:LIST(...)/ARRAY_AGG(...) ... GROUP BY <col>is recognized by the analyzer and decomposed through the sharedplan.aggregatesrecipe withneeds_global_reduce = True(Path B) — phase-1 partial implode, phase-2 concat (or, with an in-aggregateORDER BY, a key-carrying ordered merge). Both arms ride a typedAggExpr(NativeAggExprfor LIST,ScalarExprfor the SQL aggregates). The single-table fluent siblings (agg.list,collect_lists) and the nested equi-join (join(nested=True)) ship under their own entries below. Spec/plan archived atdocs/archive/FLUENT-EXTENSIONS-IMPL-PLAN.md; as-built inCLAUDE.md/ARCHITECTURE.md. -
Features A × B compose. A
.with_columns/.map_partitions/.whereafter acrossmatch(nested=True)runs on the nested per-object list columns (the list-in → scalar-out lightcurve shape: numpy mode → an object-array of per-object sub-arrays, polars mode → apl.SeriesofListwith.list.*), staying partition-local (theMapruns in phase 1, no reduce). The schema fold derives the post-aggregateMap's output columns zero-I/O. A list-output dtype and a direct post-nest.select()narrowing remain deferred (FLUENT-FUTURE-EXTENSIONS.md). -
Fluent
Catalog.collect_lists(*cols, order_by=…, descending=…). A terminal fold verb aftergroup_by— the single-table convenience sugar overagg.list, sibling of the nested join's all-RHS-columns default. It folds every column (or a named subset) except the group key(s) and the HEALPix index column into per-grouplist<T>columns named after their source column, so the headline "groupdiaSourcebydiaObjectId, collect every column into per-object lists" light-curve shape no longer makes you enumerateagg.list(...)by hand. Naming the columns narrows (only thoseagg.lists are built, so projection pushdown reads only them + the key +order_by); omitting them folds all the rest. Pure frontend sugar inapi/catalog.py— it desugars to anAggregateStepof oneagg.list(col, order_by=…)per column and flows through the existing_compile_aggregation/list_aggspecpath, so there are no_fluent/_optree/ engine changes. It inherits the cross-partition default and thegroup_by(copartitioned=True)partition-local form (and that path'sagg.list-only / plain-key / no-having-sort-limitrestrictions).order_bysorts within every list consistently (via the kwarg or a trailing"<col> DESC"suffix; the two compose by OR). Single-catalog only for now (a preceding crossmatch/join is rejected — the merged-frame fold is a follow-up); duplicate fold columns raise an actionableValidationError. Spec archived atdocs/archive/COLLECT-LISTS.md; as-built inCLAUDE.md/ARCHITECTURE.md. -
Fluent
agg.list+ co-partitioned (partition-local) single-table list fold.agg.list("col", order_by=…, descending=…)is the fluent LIST/ARRAY_AGG constructor, so single-table list folding no longer requires dropping todb.sql— cross-partition by default (one row per key, full list, correct for any partitioning, equivalent todb.sql LIST(...) GROUP BY).group_by(*keys, copartitioned=True)is the opt-in partition-local form: it asserts the keys are co-partitioned (every row sharing a key lives in one HEALPix partition — the HATS nested-association layout) so the LIST fold runs phase-1 only with no cross-partition combine, the single-table sibling of the nested equi-join.copartitionedis a derivation input to theAggregate'spartition_local(threaded through_optree.terminal_cluster, re-derived by theengine.lowerruntime guard which now reads the newAggregate.copartitionedfield, constantFalsein thedb.sqlanalyzer) — the "always derived, never hand-set" invariant (§II.5.1) is preserved. A wrongcopartitionedassertion makes a key spanning N partitions appear in N rows with split lists, which is why cross-partition is the default and the flag is off unless asked. Currentlycopartitionedrequires every aggregate beagg.listand plain column group keys, and forbids.having()/.sort()/.limit()(each needs a cross-partition combine) — per-partition decomposable aggregates, expression keys, and a catalog-level co-partition declaration are follow-ups. Spec archived atdocs/archive/FLUENT-LIST-AGGREGATE.md; as-built inCLAUDE.md/ARCHITECTURE.md. -
Nested equi-join —
Catalog.join(..., nested=True, order_by=…). Theobject⋈sourceONobjectIdlight-curve shape now folds each left row's equi-join partners into per-rowlist<T>columns (one row per object; the left row's own columns stay scalar), matching the nested crossmatch (.crossmatch(nested=True)) that landed in M1.3.order_by=sorts the elements within each list, consistently across every list column. Frontend-only — three fields onOrdinaryJoinIR, twojoinkwargs, and one detection line in_compile_components;_fluent._compile_nestedwas already join-kind-agnostic, so there are zero engine changes. The aggregation is partition-local (phase-1 only), grouped on__anchor_rowid, so the right catalog must be co-partitioned with the left by the left object's HEALPix pixel (the HATS nested-association layout); a non-co-partitioned association silently under-fills lists exactly as the flat.join()drops rows — this is a documented precondition, not verified. The cross-partition variant is a follow-up. Spec archived atdocs/archive/NESTED-EQUI-JOIN.md; as-built inCLAUDE.md/ARCHITECTURE.md §3. -
acid download <name>resolves a bare catalog name. A single bare name (no/, not a URL) now resolves its source against a new download search path —ACID_DOWNLOAD_PATH→ thedownload_pathconfig key → the built-inhttps://data.lsdb.io/hats/— and, when no destination is given, its destination againstACID_PATH(the same search pathacid queryuses). Source resolution is collection-aware: for each<root>/<name>, a directory holdingcollection.propertiesis treated as a collection and itshats_primary_table_urlchild is downloaded (soacid download two_mass→https://data.lsdb.io/hats/two_mass/two_mass). The destination follows the same bare-vs-path rule as the source: omitted →<first local writable ACID_PATH root>/<catalog name>(URL entries skipped, auto-created with a notice); a bare token →<ACID_PATH root>/<token>; a path with a/(./x,/data/x) → verbatim. An explicit source path/URL is used verbatim and still requires an explicit destination.download_pathis a first-class config key (acid config set/get/unset/show download_path) and resolves through the usual explicit → env → config → built-in precedence. -
acid inspect <name>resolves a bare catalog name againstACID_PATH(the same local search pathacid queryuses) → configpath→~/datasets, collection-aware (acollection.propertiesdirectory resolves to itshats_primary_table_urlchild). Soacid inspect two_mass/acid inspect schema two_masswork on a downloaded catalog without typing the full path. An explicit path or URL is used verbatim; a remote catalog needs its full URL (the remote download mirror is not searched).
Changed¶
-
acid.connect()removed; useacid.init()(singleton) oracid.Connection()(explicit). Per the alpha no-backcompat policy there is no deprecation shim —acid.connect(...)is gone.acid.Connection(...)is the same object it always returned (the explicit, context-manager Connection);acid.init(...)is the new module-level singleton entry. The CLI is unaffected (it builds its own Connection). -
db.in_cone(...)is now an execution-time scope, not a construction-time capture. ACatalogno longer records the cone-block context at the moment it's built; the cone is read when the query is compiled/executed (which is what already happened fordb.sql). So a query can be built once and run scoped inside anin_coneblock and full-sky outside it — and a query built inside a block but executed outside it is full-sky (previously aStaleCatalogError):
q = db.open("gaia").where("phot_g_mean_mag < 18") # built anywhere
with db.in_cone((180, 0), radius=2*u.deg):
near = q.to_polars() # scoped to the cone
allsky = q.to_polars() # full sky — same query object
Consequences: the cone is no longer part of Catalog identity (two identical
queries built under different cones compare equal); Catalog._captured_cone_stack
and the cone-prefix freshness check are gone. StaleCatalogError is removed
(it only ever signalled the cone-context mismatch this change eliminates; a
closed/GC'd Connection still raises ConnectionClosedError).
-
acid.aggconstructors renamed to the astronomer/numpy idiom —agg.avg→agg.mean,agg.stddev→agg.std,agg.variance→agg.var(the other constructors are unchanged). The old SQL-spelled names are removed (alpha: no aliases). This is the Python surface only: the internalAggExpr.funcdecomposition token stays SQL-standard ("avg"/"stddev"/"variance") — it is the key shared with thedb.sqlanalyzer, so SQL queries still spell itAVG/STDDEV/VARIANCE. (FLUENT-REFERENCE.md§4.3.) -
Nested joins execute by pre-imploding the RHS (no LHS-duplicated wide product). A
nested=Truecrossmatch/joinis now a first-classJoinproperty (Join.nested), not a synthesizedAggregateabove the join, and the engine collapses the RHS to per-anchorlist<T>columns before the join — so the anchor (often a 1000+-column object row) is never replicated across its partners. At 2000 objects × 500 epochs × 300 columns the peak RSS drops ~20× (2469 MB → 122 MB) with identical output. Behavior change: a LEFT-unmatched anchor's lists are now an empty list[](list.len() == 0), not a one-element[null]. (Engine:engine.lower._exec_nested_fold; as-built inARCHITECTURE.md§5 /CLAUDE.md#14–#16; design indocs/archive/NESTED-PREIMPLODE.md+docs/archive/NESTED-JOIN-NODE.md.) -
group_by(copartitioned=True)is now correct at partition boundaries. The single-table co-partitioned LIST fold scans the catalog's margin band (so a key's boundary-spilled rows land in the partition) and deduplicates with a localmin(_healpix_29)ownership filter — each group emitted by the one partition whose cursor pixel owns its minimum index, no cross-partition communication, still phase-1-final. A boundary-straddling object now folds into one complete row (matching the cross-partition default) instead of splitting. Correct under the precondition object extent ≤ margin radius. It now requires a spatial index (_healpix_29) and a configured margin cache — each missing piece is a compile-timeValidationError(no silent boundary-wrong fallback). The cross-partition default is unchanged and needs neither. (Design indocs/archive/COPARTITIONED-MARGIN.md; as-built inARCHITECTURE.md§5 /FLUENT-REFERENCE.md§6.5.) -
Symmetric fluent→IR→Polars redesign (the IR mirrors the fluent composition; both join operands are lowered the same way). The cKDTree spatial match is now treated as the only operation Polars can't do natively, so a
Join's left and right operands are realOpsubtrees lowered through the sameengine/lower._exec_subtree. This dissolves four asymmetries that were artifacts of the old left/right split: - No more
inline_whereon theOptree. A pre-filter on either operand is just aFilternode in that subtree, applied before the matcher — so nearest-surviving-match and LEFT-NULL semantics fall out of node ordering rather than a special right-keys filter slot. (db.sql'sRelation.inline_whereis now a value-type intermediate that_optree._leaflowers to aFilter.) - The fluent compiler is a fully-direct
_steps→Opwalk (frontend/_fluent.compile_catalog_ops) — no relational value types, no components dict, no shared assembler._optree(assemble_opplan/build_spine/terminal_cluster) is now the db.sql analyzer's path only;assemble_fluent_opplan/_build_interleaved_spine/_fold_above_nested_joinsand the transitional_fluent_direct.py+ equivalence test are deleted. - Nested joins compose.
a.crossmatch(b, nested=True).crossmatch(c, nested=True)(and the equi-join equivalent) now work — eachnested=Truejoin folds inline as a partition-localAggregateabove itsJoin, grouping on the current row's full identity (all live source rowids except the folded one), listing the folded RHS columns and carrying the rest via afirst/carryaggregate. The "nested must be the first join" and "exactly one nested join" guards + the special nested dispatch are gone. - The equi-join RHS keeps its scan rowid (
equi_right_rowid) and its_healpix_29just like a spatial-join RHS — "RHS in a join" and "RHS in a crossmatch" now carry the same surviving identity (which is what lets nested joins compose). Pruning the RHS_healpix_29was a bug; it is a normal catalog column now, kept and collision-named like any other. db.sqlcollision naming is left-bare / right-prefixed (HATS-clean). In adb.sqljoin, the left side keeps its natural names and only a colliding right column is dotted (id,b.id) — the previous scheme prefixed both sides. Output is clean for HATS round-tripping.- Operand
.select()is honored (not rejected).a.crossmatch(b.select(…))projects the operand subtree via a plainProjectnode at its top — no specialright_selectfield, no hidden coord retention. The operand is projected before the matcher/key-join, so the select must keep the coords / join key; dropping one is a clear compile-timeValidationErrorrather than a silent miss. Bare/renamed columns only — computed operand projections are deferred. -
Verbs after
.aggregate()compose in written order;.having()is removed. The aggregate is a barrier, andselect/where/with_columns/sort/limitafter it form the post-aggregate chain (_fold_post_aggregate→Project/Filter/Map/TopK/Limit), composing by position over the aggregate output. A post-aggregate.where()is the exact equivalent of the old.having()(deleted), and ordering is honored —.sort().limit(5).where(…)filters the top-5 (fluent ⊋ SQL). Engine:reduce.reduce_ops=_combine_aggregate(group_by/agg + the db.sql HAVING field + rename) +_lower_post_aggregate(lower the chain over the combined frame); phase-1_lowerpasses through everything above a combining (cross-partition) aggregate. db.sql keeps SQLHAVINGvia the retainedAggregate.havingfield (analyzer/optree untouched). Partition-requiring verbs after an aggregate (crossmatch/join/in_region) stay rejected — the reduced result isn't HEALPix-partitioned and the aggregate consumes the coordinates (issue #101). Spec archived atdocs/archive/FLUENT-IR-REDESIGN.md; as-built inCLAUDE.md(decisions #14–#16) /ARCHITECTURE.md §3–§6. -
Fluent composition is now one ordered step list (the fluent tree builder). A
Catalog's entire composition is a single ordered_stepstuple (api/catalog.py) — every verb appends exactly oneStep. The role-segregated slots (_joins/_pre_where/_post_steps/_select/_limit/_group_keys/_aggs/_having/_order/_regions) are deleted;_fold_stepsfolds_stepsinto slot-equivalents and_fluent._StepViewwraps aCatalogso the compile tail runs unchanged. Verb placement is now structural — the pre/post-wherebarrier (first join orMap) and the last-of-kind terminal verbs are decided in_fold_steps, not at verb time.__eq__/__hash__/__repr__/describefold_steps, so identity can't drift the way parallel slots could;_columns_overrideis now part of identity (adb.open(..., columns=…)subset reports distinct.columns, so it must compare/hash distinctly — a latent collision bug, fixed). Thepost_steps_post_terminalflag is gone in favor of two explicit ordered chains inassemble_fluent_opplan(post_stepson the spine,post_terminal_steps+post_terminal_projectionabove the terminal).db.sql(parser/analyzer) is byte-untouched. The design is archived atdocs/archive/FLUENT-TREE-BUILDER-DESIGN.md; the as-built isCLAUDE.md(decision #14) /ARCHITECTURE.md §3. - Post-
Mapnested.select()unlock. A.select()after a.with_columns()/.where()on acrossmatch(nested=True)now projects the post-Mapcolumns — it becomes aProjectabove the post-terminalMapchain (Project(Map(Project(Aggregate)))). A.select()directly after the nested crossmatch still prunes which RHS columns are listed. -
validate_opsis the structural backstop forMap/Filterplacement. AMaporFilterinside aJoin's subtree (the operand-subtree / between-joins shape the builder can now represent but the engine cannot yet execute) is rejected there; the thin verb-time guards (_reject_if_mapped/_reject_mapped_operand) supply the friendly, verb-keyed message. The schema fold,engine.lower._spine, andconnection._explain_planwalk through such a node so a builder-produced-but-rejected tree folds and explains cleanly. -
ParquetSinkflushes on bytes, not rows. The buffer threshold is nowbyte_target(default 256 MiB), replacing the previousrow_group_target(1M rows) entirely. The row-count threshold left the in-memory buffer free to balloon on wide schemas: 1M rows × long string / struct / list columns (e.g. the Rubin DP1 object catalog at ~1250 columns) can be many GB before the flush fires, turning the buffer into the OOM vector the flush was meant to prevent.pa.Table.nbytes(a C-level sum of Arrow buffer sizes) is the cheap estimate we accumulate againstbyte_target. Per-flush row count is now variable; therow_group_sizepin still ensures one Parquet row group per flush. Breaking config change (alpha; no shim) — therow_group_target=kwarg is gone. Fixes #33. acidnever auto-creates a user-suppliedtmpdir. A non-existent path passed toacid query --tmpdir <path>or toacid.connect(tmpdir=<path>)now fails fast (withAcidErrorfrom the CLI /ConfigErrorfrom the library) instead of being silently materialized viamkdir(parents=True, exist_ok=True). The hazard the auto-create masked was real on shared systems: a mistyped path or one whose intended mount isn't present landed scratch data on the wrong volume (often under$HOME), with no warning until the disk filled up.acid downloadalready had this fail-fast discipline; the CLI and the Python API now match. If you need the auto-create, do it explicitly (mkdir -p <path>/os.makedirs(...)) before calling acid. Fixes #72.-
The CLI moved to its own top-level package, and
import acidno longer tunes the host. The CLI source (acid.cli) is now the new top-levelacid_clipackage; the console script isacid = "acid_cli:main"andpython -m acid_clialso works.acid_cliowns its process and stages worker tuning (jemalloc_RJEM_MALLOC_CONF+ the BLAS/OMP family) before importing the heavy stack. Theacidlibrary itself sets no allocator/thread env at import — soimport acidin Jupyter (or any host app) no longer mutates that interpreter's allocator or thread pools, the silent side effect that motivated this work.acid/__init__.pyis PEP 562 lazy (heavy names resolve on first attribute access);acid --help/--version/acid configcomplete without loading numpy / polars / pyarrow. Worker tuning is staged on-demand via a numpy-free preload shim (acid.engine._worker_env) listed first inset_forkserver_preload, fed by inertACID_WORKERS_*env vars the parent exports through the new mandatorystart_pool_with_envhelper both pool builders route through. The forkserver preload list collapses to two entries (the shim + theacid.engine._preloadfat-import anchor) — adding a new heavy worker-side dependency means adding animportto_preload.py(a drift test polices this).ACID_CAP_BLASis dropped (OMP is now always managed in acid-owned processes); theACID_FORKSERVER_PRELOAD=0opt-out is dropped (preload is mandatory — the one airtight worker-tuning hook). User-facing knob is nowworkers_jemalloc_conf/ACID_WORKERS_JEMALLOC_CONFon the standardexplicit → env → config → built-inchain. One user-visible behaviour change: an importedacid.connect(workers=1)runs the in-process per-partition collect with the host's stock jemalloc (the change deliberately leaves a host untouched — the cross-processmadvisecontention the tuning kills is largest at high worker counts and negligible atworkers=1). Spec archived atdocs/archive/WORKER-ENV-TUNING.md; user-facing guidance inMEMORY-TUNING.md. -
Operator-tree engine: the flat
PlanIR is gone. The one executed plan shape is now the left-deep operator tree (OpPlan,plan/ops.py); both frontends emit it directly (analyzer.analyze_ops,_fluent.compile_catalog_ops, via the shared_optree.assemble_opplan), the engine lowers it to one PolarsLazyFrameper work-tuple (engine/lower.py), and the phase-2 reduce reads off its root nodes (reduce.reduce_ops). The flatPlandataclass and its lowering path (analyzer.analyze,_fluent.compile_catalog,Catalog._compile_plan,reduce.reduce,executor.phase1_agg/aggregate_output_columns,schema.merged_schema/output_columns/column_origin_map, thelower_legacy_plan_to_opsbridge) are deleted;acid.plan.Planis now a kept alias ofOpPlan.Connection.validate(query)returns anOpPlan. The engine refactor ADR is archived atdocs/archive/ARCHITECTURE-ENGINE.md; the as-built model isARCHITECTURE.md§4–§6. - Memory-first phase-2 reduce. Global-reduce queries (decomposable
aggregates / top-K) now combine their per-partition partials in memory
instead of always round-tripping through a Parquet tempdir — a latency win
on the interactive
COUNT(*)/GROUP BYpath. The disk reduce remains the automatic fallback when partials spill pastinmem_row_limit, so billion-row behavior is unchanged. Applies to.df(), the CLI display and single-file--out, andCatalog.save. Seedocs/archive/REDUCE-INMEMORY.md. acid querywith no--outstreams the full result. The implicit 100-row display cap is gone: output is a type-driven fixed-width table on a TTY (columns trimmed to terminal width) and TSV when piped/redirected, emitting every row. Fixes silent truncation of piped output (acid query … | wc -lpreviously returned 100, not the true count). A user-writtenLIMIT Nis still honored.db.sql(query, output=<path>)for a global reduce is now rejected (ValidationError) instead of writing phase-1 partials into the directory (which was neither the answer nor a valid catalog). Omitoutput=for the result, or useCatalog.savefor single-partition HATS.- Empty catalogs are rejected at registration. A HATS catalog path
that exists but enumerates zero partitions now raises
RegistryErroratconnect()time, instead of silently yielding an empty query result — a valid HATS catalog must have at least one partition. A non-existent path is still tolerated for offline config validation (withhpix_orderset explicitly). This removes the now-dead empty-catalog special-casing inreduce_global.
Fixed¶
-
Virtual-catalog batch slicing now works under multiple workers. Per-partition IPC batch slicing maps a cursor to batch indices via
TableSpec.partitions, but the plan shipped to workers stripspartitions(parent-only enumeration bulk) — so withworkers > 1a virtual root found no batches and read 0 rows (aworkers=1run, which doesn't strip, hid it)._strip_for_workersnow keepspartitionsfor a virtual spec (bounded bymax_partitions, cheap to ship), while still stripping a real HATS catalog's potentially-huge list. Regression test runs withworkers=2. -
Virtual-catalog rows are read once, not once per partition. A virtual catalog (
db.open(<file | frame>)) is backed by a single file that holds the whole catalog, so — unlike a real HATS partition, whose file is that one partition — each partition's root scan must be filtered to its cursor pixel._root_filtersonly applied that filter when refining below the root partition, so a virtual root read all rows in every partition:db.open(f).count()/.to_polars()returnedN×P(rows duplicated per partition), and a virtual-root crossmatch didO(P×N)wasted work. Fixed by always applying the cursor-pixel filter to a virtual root. (The crossmatch oracle tests missed it because they compare sets of pairs, which dedup the duplicates; a non-dedupingcount()/ row-count regression test now guards it.) Found while benchmarking the per-partition read path. -
Crossmatches no longer miss matches at the outer edge of an RHS footprint. A root point just outside the right catalog's footprint, within the match radius of a right point just inside (across a partition seam into an empty pixel), was silently dropped (INNER) or returned a false
NULL(LEFT) — a real correctness gap wherever two catalogs' footprints differ (common with disjoint survey footprints). Root cause: the margin cache only bridged central↔central partition seams, and the enumeration coverage was central-only, so a cursor in an empty pixel had no coverage and no staged data. Fix, in two pre-staged halves (no query-time inter-tile communication): the margin builder now also emits a "ring" of margin files at the empty pixels just outside the footprint (uniform at the catalog's finest partition order), holding the bordering central rows; and the spatial-crossmatch enumeration coverage is nowcentral ∪ ring(TableSpec.coverage_partitions), so a cursor in a ring pixel is provisioned with the ring's margin file and the matcher finds the across-edge partner. Margin builds gain a perimeter-bounded set of small ring files; query enumeration gains only the boundary-shell work tuples that were previously skipped (scales with the populated footprint perimeter, not area). Equi joins are unchanged (co-partitioned / key-based, no ring). Validated: a new synthetic edge test (INNER + LEFT) plus thebench/validationreal-data suite at exact equality with ring-inclusive margins. Existing margin caches should be rebuilt (acid hats build-margin) to pick up the ring; a ring-less cache simply keeps the old central-only coverage (no regression, no edge fix). -
FixedWidthSinkaccepts an explicitwidth=constructor kwarg. The no---outTTY display path consultedshutil.get_terminal_size((80, 24))unconditionally, so layout depended on the calling terminal — useful for the actual TTY display but a hazard for tests and library callers that wanted a deterministic layout. The newwidth=(defaultNone) pins an explicit column-drop budget when given;Nonekeeps the existing ambient-terminal behavior (which honours$COLUMNSfirst viashutil.get_terminal_size, so users can still override at runtime). Fixes #87. The test that bit the bug (test_fixed_width_sink_type_driven_null_and_nested) now useswidth=120and passes deterministically regardless of the calling terminal width. -
Streaming sinks raise a clear
OutputErroron schema drift (instead of a confusing pyarrow stack trace mid-stream). Each ofParquetSink/CsvSink/FitsSinknow compares every incoming partition table's schema against the first one it pinned the writer to; on drift it raises with a diff (added / removed columns, type changes) and explicitly says this is an acid bug, pointing at the issue tracker — because the engine is supposed to produce a consistent output schema across partitions for a given query plan (seeplan/schema.py's schema fold + Polars's null-fill on LEFT outer joins). The original reproducer in #35 (LEFT XMATCH null-fill drift tonull<>) no longer fires on current main, but the safety net catches any future regression with a useful message rather than a truncated file at--out. Closes #35. -
Download: flat columns whose name contains a dot are no longer mistaken for struct leaves.
acid.tools.downloadmapped a parquet physical column back to its top-level Arrow field by splittingpath_in_schemaon.— but a flat column literally nameda.fooand a structa's leaffooproduce identicalpath_in_schema='a.foo', so the naive split collapsed them. Now_physical_col_indiceswalks the actual Arrow schema (counting leaves per top-level field via the new_physical_col_rangeshelper) to keep the two cases distinct. This matters because the query-lowering redesign letsdb.sqlemit dotted output column names on collision (SELECT a.foo, b.foo→a.foo,b.foo), so flat dotted names can now appear in HATS trees acid itself writes. TheCLAUDE.md"Parquet physical vs logical column indices" gotcha was updated; it had been endorsing the buggy split recipe. Fixes #45. -
Scratch HEALPix-count mmaps no longer land in the output directory. Both
point_map.fitsaccumulators are pure scratch (mkstemp'd, unlinked after use) but were created inside the output tree, where a full read-back of the ~805 MB order-12 map dominates on a slow/networked store (~30 s on/sdfNFS vs ~1 s on local scratch). They now go to a scratch dir (None⇒ system temp, honours$TMPDIR): the engine skymap viaexecute(..., scratch_dir=)— the CLI HATS path passes--tmpdir, theConnectionits owned scratch dir (#68) — and the download-time catalog footprint viadownload_catalog(..., tmpdir=), exposed as a newacid download --tmpdirflag (#71). The download path's accumulator is(workers, npix), so the win there scales with worker count. Adownload --tmpdirthat doesn't exist now errors before any bytes are fetched (it is never auto-created), so a mistyped path fails loudly rather than leaving a half-built catalog.
Deviation from the spec¶
- Operand
.select()is rejected, not honored. The fluent tree builder design (FLUENT-TREE-BUILDER §1.6/§7) listed an operand's.select()(a.crossmatch(b.select(...))) as a "honored for free" unlock. Honoring it requires projecting the operand subtree before the merge — changing collision suffixing inop_merged_schemaand the matcher's right frame in_exec_merged, i.e. the high-risk operand-subtree execution the design defers. It is now rejected loudly (_reject_select_operand), turning the prior silent drop into a clear error; honoring it folds into the deferred operand-subtree-execution follow-up (CLAUDE.md"Things explicitly NOT done";FLUENT-FUTURE-EXTENSIONS.md §2.4).
[0.2.0a3] — 2026-06-03¶
Worker-startup performance, a dependency-light MOC implementation, a
unified progress UI, and a self-configuring acid.conf settings layer.
Includes public Python API and CLI changes (see Changed).
Added¶
- Memory-aware
workers="auto"— the automatic worker count is nowmin(cpu_cap, mem_cap, 24): in addition to the existing CPU cap (affinity / cgroup quota /cpu_count), a memory cap gives each worker at leastmem_per_worker_gbof RAM (default 4 GB), so a high-core but memory-modest box (or a tight cgroup memory limit) no longer spins up one worker per core and OOMs. The auto count is also capped at 24 (parallel efficiency flattens before the core count on big nodes); the cap is auto-only — an explicitworkers/ACID_WORKERS/ config /--workersis never capped. The memory figure ismin(physical RAM, cgroup memory limit); a small tolerance absorbs the kernel's RAM reservation. Newmem_per_worker_gbconfig key /ACID_MEM_PER_WORKER_GBenv /acid.connect(mem_per_worker_gb=)arg. cgroup CPU- and memory-limit detection now honors themountinforoot mapping, so a limit on a container whose own cgroup is the mount root (typical Docker/k8s) is read at the mountpoint instead of a nonexistent nested path — previously such limits were silently missed. acid.confsettings layer (config.py; seedocs/archive/CONFIG-SYSTEM.md) — per-machine configuration of the catalog searchpath, queryworkers, the per-worker RAM budgetmem_per_worker_gb, thetmpdirbase, andinmem_row_limit, resolved explicit → env → config file → built-in. Discovery is first-found-wins over~/.config/acid/acid.conf, two/sdf/.../etcpaths,$XDG_CONFIG_DIRS, and/etc/acid/acid.conf;--config FILE/ACID_CONFIGpoint at a specific file. Env overrides:ACID_PATH,ACID_WORKERS,ACID_MEM_PER_WORKER_GB,ACID_TMPDIR,ACID_INMEM_ROW_LIMIT. Newacid config show|get|set|unsetsubcommand (file values by default,--effectivefor resolved).acid querygains--tmpdir;--db/--workers/--tmpdir--helpshows the effective default + provenance.acid.connect(...)gains aconfig=argument.- rich-based progress UI (
io/progress.py+ the engine-neutralengine/_reporter.pyReporterprotocol). One self-overwriting status line on a TTY — anarrow3spinner through the setup stages, switching to a full-width block bar with percent / row count / ETA during execution. Writes to stderr only, so piped stdout results stay clean. Newacid query --progress {auto,on,off,plain}flag, fronting theACID_PROGRESSenv var (0/1/plain);plaincommits one line per stage for logs/debugging. acid queryprints a finalelapsed: <dur>runtime line (interactive only — silent when piped or--progress off). Anchored at process start, so it spans imports → parse → worker launch → execution → metadata write (≈/usr/bin/timeminus the unmeasurable interpreter launch +import acid).- Startup banner for
acid query— a one-time cyan boxed ACID logo (● ▪ A C I D, tagline, version, project URL) printed to stderr the moment the CLI is alive, with a compact fallback on narrow terminals. Gated like the progress UI (TTY /ACID_PROGRESS), so it's silent when piped or off. The box also carries a runtime/resource panel — worker count, threads per worker, and total memory, then the temp directory and (with--out) the output path, each with their filesystem's free disk space; long paths are tail-ellipsized to fit. hats/rangemoc.py— a minimalMOC(order-29[lo, hi)ranges + vectorized set algebra) covering themocpy.MOCslice acid uses, somocpyis no longer needed at runtime. Validated bit-for-bit againstmocpy(now a test-only oracle) intests/test_rangemoc.py.- Worker-startup tuning knobs, all on by default (opt out with
=0):ACID_CAP_BLAS(cap native BLAS/OMP thread pools),ACID_FORKSERVER_PRELOAD(preload the native stack into the forkserver server for COW inheritance),ACID_PREWARM(spawn all workers up front). Documented inMEMORY-TUNING.md§Worker startup.
Changed¶
acid.connect(...)settings now resolve throughacid.conf(alpha, no back-compat shims).cache_dir=is renamedtmpdir=and is now a base directory — a unique owned scratch subdir is created under it and removed onclose()(previously an explicitcache_dirwas used directly and never cleaned).workers/inmem_row_limitdefault toNone(resolve from env/config/built-in) instead of"auto"/50_000_000, andsource=Nonenow resolves (ACID_PATH→ configpath→~/datasets) rather than yielding an empty connection — passsource=[]for an explicitly empty one.acid query/validaterequire the SQL query. Pass it as the positional argument,-to read stdin, or-f FILE; omitting it is now an error (it no longer silently reads stdin).acid query's built-in--workersdefault is nowauto(cgroup-aware), not1.- Worker startup optimized for high partition counts. The plan shipped to
workers is stripped of parent-only bulk (
TableSpec.partitions/partition_index/margin_partitions,MocSpec.moc) — ~2 MB → ~1 KB on a 100k-partition catalog; a single-catalog enumeration fast path skips the dense partition-ID map; workers capture parquetFileMetaDataat write time so the master assemblesdataset/_metadataby concatenation instead of re-reading every footer. Several-fold faster cold start at high worker/partition counts. point_map.fitsaccumulation is now a single lock-free shared mmap (was per-worker rows summed at the end), written sparsely (O(nonzero)). Work tuples own disjoint footprint pixels, so workers scatter-add without locks; a runtime row-count cross-check raises loudly if that invariant is ever violated.- CLI progress is now TTY-auto — an animated bar on a stderr TTY, silent
when piped. Previously the
tqdmbar was effectively always shown.
Fixed¶
- CTRL-C during
acid querynow exits cleanly instead of flooding stderr with per-worker tracebacks. A terminal SIGINT reaches the whole foreground process group; pool workers now ignore it (so only the parent unwinds), and the parent forcibly reaps a one-shot pool (SIGTERM → brief grace → SIGKILL) before exiting with code 130. The reap ignores further SIGINTs so repeated CTRL-C can't orphan workers. A Connection-owned persistent pool is left untouched (its workers stay healthy and reinitialize on the next query). - CTRL-C during worker-pool startup no longer leaks a traceback from a
process the user never sees. Two startup-window leaks are closed: (1) the
forkserver bootstrap preloads the native stack (numpy/pyarrow/polars/…)
before it installs its own
SIGINT: SIG_IGN, so a CTRL-C mid-preload printed aKeyboardInterruptfrom deep inside animport— the forkserver is now started while the parent holdsSIGINT: SIG_IGN(which survivesexec, and a fresh Python honors an inheritedSIG_IGN), so its preload runs uninterrupted; and (2) theACID_PREWARMbarrier's transientmultiprocessing.Managerraced its own teardown on interrupt, printing aFileNotFoundError/BrokenPipeError— the prewarm now defers SIGINT across the Manager's lifecycle and re-raises it cleanly once the Manager is down. - Misleading "unknown column" error for a table alias passed to an
unrecognized function. A typo like
WHERE xIN_MOC(d, 'object')previously raised a self-contradictoryunknown column 'd' … known tables: d, owhose caret pointed at an unrelatedd.*in the SELECT. The analyzer now diagnoses it as'd' is a table alias, not a column, hints towardIN_MOC(the only acid function taking a table alias), and the error-span finder points the caret at the offending alias argument rather than the first textual match.
Removed¶
mocpyruntime dependency — moved to a test-only extra (seehats/rangemoc.py).tqdmdependency — replaced byrich(the tools' progress bars now use theacid.io.progress.barshim).
[0.2.0a2] — 2026-06-01¶
A performance + tooling release on top of the native-Polars engine. No API or on-disk changes.
Fixed¶
--workers 16native-engine regression (the headline fix). A wide LEFT-XMATCH anti-join ran ~30% slower than the old--engine=polarsbuild. Root cause was the bundled-jemalloc dirty-page purge (madvise(MADV_DONTNEED)) serializing across workers on the kernelmmap_lock, not acid code.acidnow sets_RJEM_MALLOC_CONF=dirty_decay_ms:-1,muzzy_decay_ms:-1by default (viasetdefault, so it's overridable) before Polars is imported — ~2× faster wall at high worker counts, at ~20% higher peak RSS. Full analysis inbench/W16-EFCOLLECT-REGRESSION.md.- O(C²) per-partition schema fold on wide catalogs.
_relation_columnsnow indexescolumn_typespositionally instead of the per-nameTableSpec.column_typelinear scan, which dominated worker CPU on a 1250-column light-curve catalog. coalesce=Falseon the XMATCH join made the LEFT join ~2× slower per partition on a wide right catalog. Nowcoalesce=True— invisible to output (the coalesced key is an engine-internal__column).ACID_PROFILE=1with multiple workers produced no output —forkserverworkers don't runatexithooks. Rewritten as a per-worker shared mmap (one row per worker, summed by the master), mirroring the skymap mechanism. (#50)
Added¶
MEMORY-TUNING.md— user-facing guide to the jemalloc allocator knob and the relatedworkers/threadsmemory levers; linked from the README and listed as an authoritative doc.engine/profiling.py— per-worker, per-step profiling module (anchor_setup/right_setup/xmatch/execute_final/write), with a stderr summary table and a JSON per-worker matrix (ACID_PROFILE_OUT). (#50)
[0.2.0a1] — 2026-05-31¶
Lands the fluent Catalog API (CATALOG-API.md). The old Session
/ acid.sql / acid.run surface is removed without aliases — 0.1.0a*
code won't run unchanged. See "Removed" for the migration sketch.
Polars-native engine (breaking)¶
acid now has one execution engine: native Polars. A query lowers
to a single engine-neutral Plan executed as polars.LazyFrame ops,
including the phase-2 reduce (ARCHITECTURE.md).
Removed (no aliases, alpha):
- The
engine=keyword and--engineflag — Polars is implied. - The
duckdb_threads=keyword /--duckdb-threadsflag — renamed tothreads=/--threads. - The DuckDB engine (
acid.engines.duckdb), theEngine/PartitionContextABCs, andacid.engine.resolve. - The per-tuple SQL generator (
acid.rewriter) and the SQL phase-2 reducer (acid.reducer). - The
QueryPlanIR container and theacid.planneradapter —acid.analyzer.analyzenow emitsacid.plan.Plandirectly. - DISTINCT /
COUNT(DISTINCT)/ bareGROUP BY/ unboundedORDER BY— the full-materialization fallback now raisesValidationError. Decomposable aggregates, top-K,SELECT *, and bareLIMITare unchanged. duckdbis no longer a runtime dependency (test/bench only).
Changed — scalar SQL semantics are now Polars-SQL semantics
(via pl.sql_expr):
ROUNDrounds half-to-even (2.5 → 2), was half-away.CAST(float AS INT)truncates (2.7 → 2), was rounding.- Math-domain edges —
SQRT(-x)/LN(0)yieldNaN/-infinstead of raising.
Preserved: XMATCH (SciPy cKDTree), margin handling, HATS-output
validity, aggregate decomposition, top-K / cone / MOC pushdown.
validate() returns the Plan; explain() returns a Plan summary.
Query lowering redesign (breaking)¶
A rewrite of the lowering layer (docs/archive/QUERY-LOWERING-REDESIGN.md)
collapses the two-layer IR and the column-name mangling into a single
flat-named Plan both frontends emit directly. No perf change at
scale — the win is a smaller correctness surface.
Output column names change (user-visible). Join collisions keep the
anchor's natural name and get a _<alias> suffix on the right
(id / id_b); db.sql output collisions are SQL-dotted
(a.id / b.id). Names are now final from the start — identical on
disk, at the Result boundary, and in .columns — so saved queries
referencing the old mangled/demangled names will differ.
Removed (no aliases, alpha):
XMATCH_DISTANCE(<alias>)— replaced by the opt-incrossmatch(dist_col="sep")/XMATCH(..., dist_col => 'sep')column (un-named ⇒ not emitted).IN_MOC(...)outside a top-level conjunctiveWHERE— it's a footprint restriction only (AND-edWHERE, optionalNOT, or.in_region(...)); anywhere else raisesValidationError.- The
<alias>__<col>mangle,executor.canonicalize_sql, andoutput.demangle_columns. - The fluent → SQL-string →
parser.parseround-trip — the fluentCatalognow compiles directly to aPlan(acid._fluent). - The
ir.pyscratch IR (RelationRef/XMatchJoin/AggregatePlan/PartialAgg). scalar.RangeMembership/scalar.BoolOp—Predicate = ScalarExpr.acid.rewriterandacid.engine(both deleted) — thePartitionFilterscontract now lives inacid.ir.
Fluent join surface. Catalog.join(other, on=...) takes a single
on= — on="id" or on=(left, right), integer-ID equi-joins only.
Added¶
acid.connect(source, *, workers="auto", threads=None, inmem_row_limit=50_000_000, cache_dir=None, progress="auto")— the single top-level entry point, replacingacid.sql/acid.run.Connection(renamed fromSession) — cgroup-awareworkers="auto", lazy pool, sticky settings,__del__/atexitcleanup, pickling rejected.Connection.open(name_or_path, *, alias=None, columns=None) -> Catalog— fluent entry; eager metadata read, lazy pool start.Connection.in_cone(center, *, radius)— context manager scoping a cone over every query in the block (replaces the oldcone=kwarg); nested blocks rejected.Connection.list_catalogs()— walk the registered roots for HATS-catalog basenames.Connection.add_catalog(name, **spec_kwargs) -> Catalog— lower-level alternative toopen.Connection.register_moc(name, source)— unchanged fromSession.register_moc.Connection.map_partitions_sql(query, *, output=None, progress=None)— renamed fromSession.run.Catalog— frozen lazy-query handle. Composition verbs (where/select/limit/in_region/crossmatch/join/group_by/aggregate/having/sort) return new Catalogs; materialization verbs (head/execute/to_pandas/to_polars/to_astropy/to_arrow/save) run through the Connection.Catalog.crossmatch(other, *, radius, how="nearest")— spatial XMATCH;radiusis anastropyQuantity(bare floats rejected).Catalog.join(other, *, on, how="inner")— integer-ID equi-join; bare-name + tuple forms; double-join / cross-Connection rejected.Catalog.group_by(*keys)/aggregate(**named)/having(pred)— fluent aggregation viaacid.aggconstructors; keys appear output-first. Aggregation is the outer query — onlyhaving/sort/limitmay follow.Catalog.sort(*keys, descending=False, nulls_last=False)— ORDER BY; with.limit(n)it's top-K (pushed to phase-1). A standalone unbounded sort is rejected.acid.agg(andacid.AggExpr) — the aggregate-constructor namespace forCatalog.aggregate.Catalog.in_region(region)— a registered name, peerCatalog, FITS path/URL,mocpy.MOC,MocSpec, or HATS directory. Named/handle forms compile toIN_MOCverbatim; anonymous sources are content-hashed.- Inline subquery / CTE pre-filter (
db.sql) — anchorFROM, XMATCH-RHS, and ordinary-JOIN-RHS accept(SELECT * FROM <catalog> [WHERE <pred>]) AS alias, folded into the per-partition filter plumbing. Projection narrowing / joins / aggregates / DISTINCT / GROUP BY / ORDER BY / LIMIT inside are rejected; seedocs/archive/SUBQUERY-RHS.md. Catalog.save(path, *, name=None, overwrite=False, progress=None) -> Catalog— atomic-on-success HATS write (stages to a sibling<name>.acid-save-tmp, renames only on success). Returns a registered handle.Result.show(n=20, *, width=10_000)— terminal pretty-print (same look asacid query);Result.__str__returns it soprint(r)works.Result.write_csv(path)/write_fits(path)/write(path, format=None)— single-file writers alongsidewrite_parquet;writeinfers format from the extension.output.format_table_text/output.print_table— exported Arrow-table formatting helpers.progresskwarg on every materialization method plus a Connection-level default;"auto"enables tqdm under a TTY / Jupyter and stays silent under pytest.ConnectionClosedError,StaleCatalogError— typed errors;StaleCatalogErrorcarriescaptured_cones/current_cones.ConeSpec,ConnectionStatus— public dataclasses for cone geometry andConnection.status().- Astropy adapter (
acid._coerce) — convertsQuantity/SkyCoord/(ra, dec)to engine types. Astropy is now a hard runtime dependency (imported lazily).
Changed¶
- Default
workersis"auto"(was1) — resolves tomin(sched_getaffinity, cgroup CPU quota)on Linux, elseos.cpu_count(). Tests opt back toworkers=1. - CLI
_print_tableis now a thin wrapper aroundoutput.print_table. Catalog.savestaging dir is now visible tols(<name>.acid-save-tmp, no leading dot).- Runtime dependencies.
astropy>=5andfsspec[http]>=2023.1are now hard deps;duckdbis test/bench-only.
Removed¶
acid.sql(query, catalogs=C, ...)— usewith acid.connect(C) as db: db.sql(query).acid.run(query, catalogs=C, output=O, ...)— usewith acid.connect(C) as db: db.map_partitions_sql(query, output=O).- The
Sessionclass /SessionClosedError— renamed toConnection/ConnectionClosedError;Session.run→map_partitions_sql. Session.materialize(name, query, ...)— useCatalog.save(path, name=name), ordb.sql(query, output=path)+db.add_catalog(name, path=...).- The
cone=(ra, dec, r_deg)kwarg onsql/run— usewith db.in_cone((ra, dec), radius=r_deg*u.deg):. acid.Registryis no longer in__all__(still importable).
Fixed¶
A senior-engineering pass after M1–M3 surfaced several bugs
(CATALOG-API.md §14.4):
- Nested
in_coneblocks silently dropped the inner cone — now rejected. - Auto-MOC names defaulted to basenames, silently aliasing — now content-hashed.
Connection.open(absolute_path)rebuilt theTableSpeceach call, breakingCatalog.__eq__— now reuses the cache hit.Connection.add_catalogreturned aTableSpec— now returns aCatalog.- Catalog names with special chars produced invalid SQL — identifiers now quoted (mostly moot post-redesign).
Catalog.save(overwrite=True)deleted the target before running, so a failure lost data — now stages and atomic-renames.Catalog.in_region(<MOC>)emittedIN_MOCinside a filtered subquery (rejected by the engine) — moved to outer WHERE.
Performance¶
- Persistent worker pool with Manager-dict plan delivery. The pool
stays alive across queries, and the
Planis delivered once per query via amultiprocessing.Managerdict-proxy keyed by version instead of as a per-task argument. This liftsCatalog.savethroughput to parity withacid query(was capped at ~7-8 tasks/s) while avoiding fork() of heavy-RAM notebook parents.
Notes for downstream users¶
- CLI subcommands are unchanged in name/shape, with flag changes:
acid querygains--format(hats/parquet/csv/fits) and drops--reduced-out;acid downloadgains--insecure;--engine/--duckdb-threadsare gone (--duckdb-threads→--threads). - The on-disk HATS catalog format is unchanged.
- Flat-on-disk column names resolve
#41; round-tripping a
db.sqldotted name (a.foo) through the loader is tracked as #45.
[0.1.0a3] — earlier release¶
See git history (git log v0.1.0a2..v0.1.0a3). This CHANGELOG
starts at 0.2.0; prior releases are reconstructable from commit
messages.