Performance & parallelism¶

This page is the practical guide to running ACID at scale: how the worker and thread budget divides up, when results spill from memory to disk, how to tune the allocator on big machines, and what to change between a laptop debug session and a cluster production run.

The default settings work well on most machines without intervention. Reach for this page when you're scaling up — on a many-core node, in a cgroup-restricted container, or when a query hits memory limits.

Worker & thread budget (and cgroups)¶

ACID executes partition-by-partition through a process pool. Two knobs control how much hardware that pool consumes:

workers — the number of worker processes.
threads — the per-worker Polars thread budget (POLARS_MAX_THREADS).

Their product, workers × threads, is the total Rayon-thread count the pool can run concurrently.

`cpu_cap()` — the cgroup-aware ceiling¶

ACID never uses os.cpu_count() directly. The single source of truth for "how many cores can this process actually use" is acid._sysinfo.cpu_cap(), which returns min(sched_getaffinity, cgroup CPU quota) (falling back to os.cpu_count() when neither is set, never below 1).

The matters in a container: on a 128-core host with a 4-core cgroup quota, os.cpu_count() reports 128 but cpu_cap() reports 4. Sizing workers or threads against os.cpu_count() would oversubscribe the quota by 32×.

Defaults¶

Setting	Default	Resolved from
`workers`	`"auto"`	`ACID_WORKERS`, config `workers`, `--workers`, then `"auto"`
`"auto"` value	`min(cpu_cap, mem_cap, 24)`	CPU cap from above; memory cap = `available_ram_bytes // mem_per_worker_gb`; auto is capped at 24
`threads`	`cpu_cap // workers` (single worker → full `cpu_cap`)	`--threads`, `threads=`, then the divided default
`mem_per_worker_gb`	`4` GB	`ACID_MEM_PER_WORKER_GB`, config, then built-in

An explicit workers (Python kwarg, CLI flag, env var, or config) is never capped at 24 — only the "auto" resolution applies that ceiling. The 24-worker cap exists because efficiency flattens before the core count on big nodes; passing workers=64 will use 64 anyway.

How to set them¶

PythonCLI

import acid

# Auto-size on the current machine (the default).
acid.init("/data/hats")

# Explicit: 16 workers, 2 threads each = 32 Rayon threads.
acid.init("/data/hats", workers=16, threads=2)

# Memory-tight container: give each worker more RAM headroom.
acid.init("/data/hats", mem_per_worker_gb=16)

(Each acid.init(...) above is shown standalone for clarity; in a single process you call it once. Re-initializing with a different config raises ConfigError — call acid.shutdown() first to rebuild.)

acid query "..." --db /data/hats --workers 16 --threads 2

# Or via env vars (picked up by both CLI and Python entry points).
export ACID_WORKERS=16
export ACID_MEM_PER_WORKER_GB=16
acid query "..." --db /data/hats

Symptoms and first knobs¶

Symptom	First knob
Phase 1 slow on a big machine	Raise `workers` explicitly (`"auto"` caps at 24, so on a 64-core node it won't go higher unless you ask); confirm via `ACID_PROFILE=1` that `anchor_setup` / `xmatch` dominate
Job OOMs in phase 1	Lower `ram_budget` (work tuples shrink to fit `ram_budget / workers`); or lower `workers`; or raise `mem_per_worker_gb` so `"auto"` picks fewer — see RAM budget
Job OOMs in phase 2 (reduce)	Lower `inmem_row_limit` so phase 1 spills earlier and the reduce stays disk-backed (see Memory & spill)
Process pool churn between queries	The module-level default connection reuses one persistent pool across the process; don't `acid.shutdown()` between queries (see Laptop → cluster)
Threads contend; CPU under-utilized	Check whether `workers × threads` matches `cpu_cap()`; cgroup quota may be lower than expected

RAM budget and out-of-memory jobs¶

ACID sizes its unit of work — the work tuple — to fit a RAM budget, so a query streams through memory in bounded chunks instead of loading a whole HEALPix partition at once. This is the primary protection against phase-1 OOM, and it is on by default.

How it works: ACID estimates each candidate sky cell's resident bytes (from per-catalog row-count maps and the columns your query reads) and picks the coarsest cells whose estimated bytes fit ram_budget / workers — coalescing many small partitions into one tuple, and splitting an oversized partition into several. You don't manage partitions; you set one number, ram_budget, and the planner sizes the work to it.

ram_budget is the total RAM the planner budgets for, across all workers. Default: 0.25 × available RAM (cgroup-aware — the cgroup limit, not the host total). Set it as bytes or a human size:

PythonCLIEnv / config

import acid

acid.init("/data/hats", workers=32, ram_budget="64GB")

acid query "..." --db /data/hats --workers 32 --ram-budget 64GB

export ACID_RAM_BUDGET=64GB          # or 512MiB, 32g, or a byte count
# or in the YAML / acid config: ram_budget: 64GB

When a phase-1 OOM happens, lower ram_budget first — it shrinks each tuple's working set directly, which is the most targeted lever. Lowering workers also helps (fewer concurrent working sets), and raising mem_per_worker_gb makes workers="auto" pick fewer. The RAM ceiling will split a partition below its on-disk granularity when it must to stay within budget, so it protects you even on a catalog with a few enormous partitions.

Phase-1 OOM vs. reduce OOM

ram_budget bounds the phase-1 per-tuple working set. A reduce-step OOM — usually a r.to_polars() / r.to_pandas() that loads a large result whole — is a different problem; lower inmem_row_limit and stream with r.batches() instead (below).

Memory & spill¶

ACID does not hold the whole result in memory by default. Two mechanisms keep the parent process bounded:

`inmem_row_limit` — phase-1 spill threshold¶

When a query's phase-1 partial output grows past inmem_row_limit rows (default 50_000_000), ACID spills the partials to a per-Connection scratch directory instead of accumulating them in RAM. The Result you get back is disk-backed — r.to_arrow() and r.batches() load lazily from the spill.

The threshold is an acid.init(...) keyword:

acid.init("/data/hats", inmem_row_limit=200_000_000)

Or via env var: ACID_INMEM_ROW_LIMIT=200_000_000.

When to raise it: the result genuinely fits in RAM and you want to skip the spill (small global aggregates on a memory-rich box).

When to lower it: the reduce step is OOM-ing on big results. A smaller inmem_row_limit makes phase 1 spill earlier and forces the phase-2 reduce to run from disk (in batches) instead of materializing the merged frame in memory.

See Working with results & exporting — Streaming for the read side: r.batches() is the streaming consumer, and is the right tool when even the spilled result is too large to load whole.

`--mem-limit` on `acid hats build-margin`¶

The margin-cache builder has its own, separate spill knob. It's covered on Margin caches — Spilling mechanics. Do not confuse the two: inmem_row_limit controls query-time spill; --mem-limit controls margin-cache-build-time spill.

Allocator tuning¶

On large machines, the memory allocator that Polars statically bundles (jemalloc) can dominate wall time at high worker counts — each worker's madvise(MADV_DONTNEED) calls serialize on the kernel's mmap_lock. ACID's default flips jemalloc into a "never purge" mode that removes the contention at the cost of ~20 % higher peak RSS.

The knob is the env var _RJEM_MALLOC_CONF, set at import time:

_RJEM_MALLOC_CONF=dirty_decay_ms:-1,muzzy_decay_ms:-1

You only need to override this on memory-constrained nodes. The full reference, including the lifecycle ("must be set before import polars") and the measured speed-vs-RAM tradeoffs, is in MEMORY-TUNING.md.

Worker-startup knobs¶

Three default-on env vars control worker-pool startup. You opt out by setting them to 0:

Variable	Default	What it does	Opt out when…
`ACID_CAP_BLAS`	on	Caps `OPENBLAS_NUM_THREADS` / `OMP_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` to 1 before numpy imports. ACID's compute is Polars-governed; BLAS pools only burn CPU at import.	You do heavy numpy/BLAS linalg in the same process.
`ACID_FORKSERVER_PRELOAD`	on	Pre-imports numpy/pyarrow/polars/scipy/cdshealpix in the `forkserver`, so workers inherit copy-on-write instead of re-importing serially.	Many short one-off queries where the ~2.5 s bootstrap dominates.
`ACID_PREWARM`	on	Brings all workers online together behind a barrier before the first query.	You want lazy worker spawn (the first query may only touch a few partitions).

These mostly matter at high worker counts; at workers=1 they're harmless but buy little. Full details in MEMORY-TUNING.md.

Profiling — `ACID_PROFILE`¶

To see where a query spends its time, set ACID_PROFILE=1 before launching:

ACID_PROFILE=1 acid query "..." --db /data/hats --workers 16

A per-step summary prints to stderr at the end of the run, showing which phases (anchor_setup, right_setup, xmatch, execute_final, write) dominate. The same flag works for the Python API — set the env var before importing acid. Profiling never affects results, only reporting.

For the full per-worker matrix (a JSON file with timings for every work-tuple), also set:

ACID_PROFILE_OUT=/tmp/profile.json

Open the resulting JSON in Polars / pandas to identify outlier partitions or imbalanced work.

Laptop → cluster scaling¶

The same code runs on a laptop and a cluster; what changes is the amount of hardware and a few env settings.

Laptop¶

# Default — auto-sizes against the laptop's affinity / RAM.
import acid
import astropy.units as u

df = acid.open("a").crossmatch(acid.open("b"), radius=1 * u.arcsec).to_polars()

The laptop typically has 4–16 cores and 16–64 GB of RAM. workers="auto" picks a small worker count, the spill threshold (50 M rows) almost never trips, and _RJEM_MALLOC_CONF's ~20 % RSS overhead is invisible. Nothing to tune.

Cluster / SLURM job¶

Three things change:

Reuse the connection. A worker pool spin-up is ~2.5 s of imports plus per-worker init. Spinning a pool per query is the easiest way to make a "fast" engine look slow. The module-level default connection is one shared pool across the whole process, so a batch loop reuses it automatically — just don't acid.shutdown() between iterations:
```
import acid

acid.init("/data/hats", workers=32)
for region in regions:
    r = acid.sql.query(f"SELECT ... WHERE {region}")
    r.export(f"out/{region}.parquet")
```
Respect the cgroup. Inside a SLURM --cpus-per-task=8 allocation, os.cpu_count() still reports the full host, but ACID uses cpu_cap(). With workers="auto", you'll get a sensible count without flag-fiddling — it sees the 8-core cgroup quota, not the host's 128 cores. The same goes for memory: mem_per_worker_gb bounds "auto" against min(physical RAM, cgroup memory limit), not the host RAM.
Pin workers / threads if you need determinism. For reproducibility across runs (benchmarking, scheduling against downstream consumers), set both explicitly:
```
acid query "..." --db /data/hats --workers 16 --threads 2
```
or, equivalently, ACID_WORKERS=16 / --threads 2.

When to override `_RJEM_MALLOC_CONF`¶

On a fat-RAM compute node, leave the default. On a shared login node with an RSS cap, or a tight cgroup memory limit, set:

export _RJEM_MALLOC_CONF=dirty_decay_ms:10000,muzzy_decay_ms:10000

This recovers most of the RAM (jemalloc purges freed pages after 10 seconds of idleness) at the cost of ~⅔ of the at-scale speed win. See MEMORY-TUNING.md for the full table of choices.

What's not done¶

A few items are deliberately not implemented; if you're paged into one, the symptom is real but the fix lives elsewhere:

Per-worker parquet LRU cache. With adaptive Norder, a coarser right partition is re-read once per refinement leaf. At extreme parallelism this shows up as redundant I/O in the profile. Workaround: rebuild the right catalog's margin at a wider radius (one re-read per leaf, but a bounded set), or use fewer workers.
Locality-preserving anchor scheduling. Tuples that share a coarser partition aren't co-scheduled, so the filesystem cache helps less than it could.
Streaming progress callbacks. Progress is rendered to stderr only (--progress on/off/plain); there's no Python callback hook.

If a workload is consistently bound by one of these, surface it — the roadmap is in FLUENT-FUTURE-EXTENSIONS.md and CLAUDE.md's "Things explicitly NOT done" section.