Skip to content

Performance & parallelism

This page is the practical guide to running ACID at scale: how the worker and thread budget divides up, when results spill from memory to disk, how to tune the allocator on big machines, and what to change between a laptop debug session and a cluster production run.

The default settings work well on most machines without intervention. Reach for this page when you're scaling up — on a many-core node, in a cgroup-restricted container, or when a query hits memory limits.

Worker & thread budget (and cgroups)

ACID executes partition-by-partition through a process pool. Two knobs control how much hardware that pool consumes:

  • workers — the number of worker processes.
  • threads — the per-worker Polars thread budget (POLARS_MAX_THREADS).

Their product, workers × threads, is the total Rayon-thread count the pool can run concurrently.

cpu_cap() — the cgroup-aware ceiling

ACID never uses os.cpu_count() directly. The single source of truth for "how many cores can this process actually use" is acid._sysinfo.cpu_cap(), which returns min(sched_getaffinity, cgroup CPU quota) (falling back to os.cpu_count() when neither is set, never below 1).

The matters in a container: on a 128-core host with a 4-core cgroup quota, os.cpu_count() reports 128 but cpu_cap() reports 4. Sizing workers or threads against os.cpu_count() would oversubscribe the quota by 32×.

Defaults

Setting Default Resolved from
workers "auto" ACID_WORKERS, config workers, --workers, then "auto"
"auto" value min(cpu_cap, mem_cap, 24) CPU cap from above; memory cap = available_ram_bytes // mem_per_worker_gb; auto is capped at 24
threads cpu_cap // workers (single worker → full cpu_cap) --threads, threads=, then the divided default
mem_per_worker_gb 4 GB ACID_MEM_PER_WORKER_GB, config, then built-in

An explicit workers (Python kwarg, CLI flag, env var, or config) is never capped at 24 — only the "auto" resolution applies that ceiling. The 24-worker cap exists because efficiency flattens before the core count on big nodes; passing workers=64 will use 64 anyway.

How to set them

import acid

# Auto-size on the current machine (the default).
acid.init("/data/hats")

# Explicit: 16 workers, 2 threads each = 32 Rayon threads.
acid.init("/data/hats", workers=16, threads=2)

# Memory-tight container: give each worker more RAM headroom.
acid.init("/data/hats", mem_per_worker_gb=16)

(Each acid.init(...) above is shown standalone for clarity; in a single process you call it once. Re-initializing with a different config raises ConfigError — call acid.shutdown() first to rebuild.)

acid query "..." --db /data/hats --workers 16 --threads 2

# Or via env vars (picked up by both CLI and Python entry points).
export ACID_WORKERS=16
export ACID_MEM_PER_WORKER_GB=16
acid query "..." --db /data/hats

Symptoms and first knobs

Symptom First knob
Phase 1 slow on a big machine Raise workers explicitly ("auto" caps at 24, so on a 64-core node it won't go higher unless you ask); confirm via ACID_PROFILE=1 that anchor_setup / xmatch dominate
Job OOMs in phase 1 Lower ram_budget (work tuples shrink to fit ram_budget / workers); or lower workers; or raise mem_per_worker_gb so "auto" picks fewer — see RAM budget
Job OOMs in phase 2 (reduce) Lower inmem_row_limit so phase 1 spills earlier and the reduce stays disk-backed (see Memory & spill)
Process pool churn between queries The module-level default connection reuses one persistent pool across the process; don't acid.shutdown() between queries (see Laptop → cluster)
Threads contend; CPU under-utilized Check whether workers × threads matches cpu_cap(); cgroup quota may be lower than expected

RAM budget and out-of-memory jobs

ACID sizes its unit of work — the work tuple — to fit a RAM budget, so a query streams through memory in bounded chunks instead of loading a whole HEALPix partition at once. This is the primary protection against phase-1 OOM, and it is on by default.

How it works: ACID estimates each candidate sky cell's resident bytes (from per-catalog row-count maps and the columns your query reads) and picks the coarsest cells whose estimated bytes fit ram_budget / workers — coalescing many small partitions into one tuple, and splitting an oversized partition into several. You don't manage partitions; you set one number, ram_budget, and the planner sizes the work to it.

ram_budget is the total RAM the planner budgets for, across all workers. Default: 0.25 × available RAM (cgroup-aware — the cgroup limit, not the host total). Set it as bytes or a human size:

import acid

acid.init("/data/hats", workers=32, ram_budget="64GB")
acid query "..." --db /data/hats --workers 32 --ram-budget 64GB
export ACID_RAM_BUDGET=64GB          # or 512MiB, 32g, or a byte count
# or in the YAML / acid config: ram_budget: 64GB

When a phase-1 OOM happens, lower ram_budget first — it shrinks each tuple's working set directly, which is the most targeted lever. Lowering workers also helps (fewer concurrent working sets), and raising mem_per_worker_gb makes workers="auto" pick fewer. The RAM ceiling will split a partition below its on-disk granularity when it must to stay within budget, so it protects you even on a catalog with a few enormous partitions.

Phase-1 OOM vs. reduce OOM

ram_budget bounds the phase-1 per-tuple working set. A reduce-step OOM — usually a r.to_polars() / r.to_pandas() that loads a large result whole — is a different problem; lower inmem_row_limit and stream with r.batches() instead (below).

Memory & spill

ACID does not hold the whole result in memory by default. Two mechanisms keep the parent process bounded:

inmem_row_limit — phase-1 spill threshold

When a query's phase-1 partial output grows past inmem_row_limit rows (default 50_000_000), ACID spills the partials to a per-Connection scratch directory instead of accumulating them in RAM. The Result you get back is disk-backedr.to_arrow() and r.batches() load lazily from the spill.

The threshold is an acid.init(...) keyword:

acid.init("/data/hats", inmem_row_limit=200_000_000)

Or via env var: ACID_INMEM_ROW_LIMIT=200_000_000.

When to raise it: the result genuinely fits in RAM and you want to skip the spill (small global aggregates on a memory-rich box).

When to lower it: the reduce step is OOM-ing on big results. A smaller inmem_row_limit makes phase 1 spill earlier and forces the phase-2 reduce to run from disk (in batches) instead of materializing the merged frame in memory.

See Working with results & exporting — Streaming for the read side: r.batches() is the streaming consumer, and is the right tool when even the spilled result is too large to load whole.

--mem-limit on acid hats build-margin

The margin-cache builder has its own, separate spill knob. It's covered on Margin caches — Spilling mechanics. Do not confuse the two: inmem_row_limit controls query-time spill; --mem-limit controls margin-cache-build-time spill.

Allocator tuning

On large machines, the memory allocator that Polars statically bundles (jemalloc) can dominate wall time at high worker counts — each worker's madvise(MADV_DONTNEED) calls serialize on the kernel's mmap_lock. ACID's default flips jemalloc into a "never purge" mode that removes the contention at the cost of ~20 % higher peak RSS.

The knob is the env var _RJEM_MALLOC_CONF, set at import time:

_RJEM_MALLOC_CONF=dirty_decay_ms:-1,muzzy_decay_ms:-1

You only need to override this on memory-constrained nodes. The full reference, including the lifecycle ("must be set before import polars") and the measured speed-vs-RAM tradeoffs, is in MEMORY-TUNING.md.

Worker-startup knobs

Three default-on env vars control worker-pool startup. You opt out by setting them to 0:

Variable Default What it does Opt out when…
ACID_CAP_BLAS on Caps OPENBLAS_NUM_THREADS / OMP_NUM_THREADS / MKL_NUM_THREADS / NUMEXPR_NUM_THREADS to 1 before numpy imports. ACID's compute is Polars-governed; BLAS pools only burn CPU at import. You do heavy numpy/BLAS linalg in the same process.
ACID_FORKSERVER_PRELOAD on Pre-imports numpy/pyarrow/polars/scipy/cdshealpix in the forkserver, so workers inherit copy-on-write instead of re-importing serially. Many short one-off queries where the ~2.5 s bootstrap dominates.
ACID_PREWARM on Brings all workers online together behind a barrier before the first query. You want lazy worker spawn (the first query may only touch a few partitions).

These mostly matter at high worker counts; at workers=1 they're harmless but buy little. Full details in MEMORY-TUNING.md.

Profiling — ACID_PROFILE

To see where a query spends its time, set ACID_PROFILE=1 before launching:

ACID_PROFILE=1 acid query "..." --db /data/hats --workers 16

A per-step summary prints to stderr at the end of the run, showing which phases (anchor_setup, right_setup, xmatch, execute_final, write) dominate. The same flag works for the Python API — set the env var before importing acid. Profiling never affects results, only reporting.

For the full per-worker matrix (a JSON file with timings for every work-tuple), also set:

ACID_PROFILE_OUT=/tmp/profile.json

Open the resulting JSON in Polars / pandas to identify outlier partitions or imbalanced work.

Laptop → cluster scaling

The same code runs on a laptop and a cluster; what changes is the amount of hardware and a few env settings.

Laptop

# Default — auto-sizes against the laptop's affinity / RAM.
import acid
import astropy.units as u

df = acid.open("a").crossmatch(acid.open("b"), radius=1 * u.arcsec).to_polars()

The laptop typically has 4–16 cores and 16–64 GB of RAM. workers="auto" picks a small worker count, the spill threshold (50 M rows) almost never trips, and _RJEM_MALLOC_CONF's ~20 % RSS overhead is invisible. Nothing to tune.

Cluster / SLURM job

Three things change:

  1. Reuse the connection. A worker pool spin-up is ~2.5 s of imports plus per-worker init. Spinning a pool per query is the easiest way to make a "fast" engine look slow. The module-level default connection is one shared pool across the whole process, so a batch loop reuses it automatically — just don't acid.shutdown() between iterations:

    import acid
    
    acid.init("/data/hats", workers=32)
    for region in regions:
        r = acid.sql.query(f"SELECT ... WHERE {region}")
        r.export(f"out/{region}.parquet")
    
  2. Respect the cgroup. Inside a SLURM --cpus-per-task=8 allocation, os.cpu_count() still reports the full host, but ACID uses cpu_cap(). With workers="auto", you'll get a sensible count without flag-fiddling — it sees the 8-core cgroup quota, not the host's 128 cores. The same goes for memory: mem_per_worker_gb bounds "auto" against min(physical RAM, cgroup memory limit), not the host RAM.

  3. Pin workers / threads if you need determinism. For reproducibility across runs (benchmarking, scheduling against downstream consumers), set both explicitly:

    acid query "..." --db /data/hats --workers 16 --threads 2
    

    or, equivalently, ACID_WORKERS=16 / --threads 2.

When to override _RJEM_MALLOC_CONF

On a fat-RAM compute node, leave the default. On a shared login node with an RSS cap, or a tight cgroup memory limit, set:

export _RJEM_MALLOC_CONF=dirty_decay_ms:10000,muzzy_decay_ms:10000

This recovers most of the RAM (jemalloc purges freed pages after 10 seconds of idleness) at the cost of ~⅔ of the at-scale speed win. See MEMORY-TUNING.md for the full table of choices.

What's not done

A few items are deliberately not implemented; if you're paged into one, the symptom is real but the fix lives elsewhere:

  • Per-worker parquet LRU cache. With adaptive Norder, a coarser right partition is re-read once per refinement leaf. At extreme parallelism this shows up as redundant I/O in the profile. Workaround: rebuild the right catalog's margin at a wider radius (one re-read per leaf, but a bounded set), or use fewer workers.
  • Locality-preserving anchor scheduling. Tuples that share a coarser partition aren't co-scheduled, so the filesystem cache helps less than it could.
  • Streaming progress callbacks. Progress is rendered to stderr only (--progress on/off/plain); there's no Python callback hook.

If a workload is consistently bound by one of these, surface it — the roadmap is in FLUENT-FUTURE-EXTENSIONS.md and CLAUDE.md's "Things explicitly NOT done" section.

See also

  • MEMORY-TUNING.md — the allocator and worker-startup deep dive (root, single source of truth).
  • Margin caches — the build-time memory knob (--mem-limit), which is different from inmem_row_limit.
  • Working with results & exportingr.batches() for streaming large results, the spill story from the consumer side.
  • Errors — ExecutionError — what an OOM looks like and the first knob to reach for.
  • CLI reference — every acid query / acid hats build-margin flag with defaults.