Filtering rows¶
Most queries start with a row filter: keep sources brighter than some magnitude, drop bad-quality flags, take only objects with finite photometry. This page covers ACID's plain row-filtering surface:
Catalog.where("<predicate>")(fluent) / SQLWHEREclauses;- how predicates compose (AND / OR / parentheses);
- NULL handling (
IS NULL,IS NOT NULL); - which filters get pushed all the way down to parquet — the
_healpix_29row-group story.
Spatial filters — restricting to a sky cone or a survey-footprint MOC —
have their own page, Sky regions & footprints. The
two layers compose: a where("g_mag < 18") and an in_region("des_dr2")
on the same Catalog both apply.
The shortest filter¶
The fluent surface and SQL accept the same predicate strings — both are parsed by SQLGlot and lowered through Polars-SQL. Pick the surface that matches the rest of your code.
Catalog.where(...) is sticky on the catalog — calling it twice composes
the predicates with AND. The two snippets below produce the same plan:
gaia.where("phot_g_mean_mag < 16").where("phot_g_mean_mag > 12")
gaia.where("phot_g_mean_mag < 16 AND phot_g_mean_mag > 12")
If you want OR, write it inline:
What the predicate looks like¶
The fluent surface works with flat column names — the ones
cat.columns reports — not alias.col:
gaia.where("phot_g_mean_mag < 16") # OK
gaia.where("gaia.phot_g_mean_mag < 16") # not how the fluent surface works
The SQL escape hatch (acid.sql.query(...)) still parses real SQL and accepts
alias.col:
The expression grammar is the SQL one — comparisons, AND / OR /
NOT, parentheses, arithmetic, IS NULL / IS NOT NULL, BETWEEN,
IN (...), simple scalar functions. (Inside inline subquery / CTE
pre-filters the allowed set is a narrower allowlist; see the
SQL escape hatch page for that case.)
Composing predicates¶
Color cuts and AND-chains¶
clean = (gaia
.where("phot_g_mean_mag IS NOT NULL")
.where("phot_g_mean_mag BETWEEN 14 AND 18")
.where("bp_rp BETWEEN 0.5 AND 1.5"))
The same as one big predicate; pick whichever reads better:
clean = gaia.where(
"phot_g_mean_mag IS NOT NULL "
"AND phot_g_mean_mag BETWEEN 14 AND 18 "
"AND bp_rp BETWEEN 0.5 AND 1.5"
)
Disjunctions and parentheses¶
When mixing AND and OR, parenthesize. SQL's precedence puts AND ahead of OR, which is rarely what you want:
IN (...) and lists of values¶
gaia.where("band IN ('g', 'r', 'i')")
gaia.where("source_id IN (4295806720, 4295806721, 4295806722)")
The right-hand side must be a list of literals on the fluent surface. For "is this column in another column", drop into SQL.
NULL handling¶
NULLs propagate through SQL the usual way: NULL < 18 is NULL, not
TRUE or FALSE. Row filters require an explicit IS NULL /
IS NOT NULL:
# Sources where photometry exists at all:
have_g = gaia.where("phot_g_mean_mag IS NOT NULL")
# The classic anti-join shape — see crossmatch.md:
unmatched = (a.crossmatch(b, radius=1 * u.arcsec, how="left", dist_col="d")
.where("d IS NULL"))
d IS NULL after a how="left" crossmatch is how you ask for every
anchor without a counterpart. Plain d = NULL is always NULL and
filters nothing.
Filtering before vs. after a join¶
Filters applied before a join run as a pre-filter on the catalog they were attached to:
# `bright_a` is filtered before the matcher sees it.
bright_a = a.where("phot_g_mean_mag < 16")
matches = bright_a.crossmatch(b, radius=1 * u.arcsec)
This is a real perf win on tight cuts — the matcher only ever sees the bright anchors.
To pre-filter the right operand instead, filter it inside the
.crossmatch(...) call (the operand is a real subtree). With
maxmatch=1 (nearest) this gives "nearest surviving match"
semantics: the matcher only sees the bright right-side sources, so the
nearest match is the nearest bright one — not the nearest source,
then dropped:
# pre-filter the right side: nearest *bright* counterpart
matches = a.crossmatch(b.where("mag < 20"), radius=1 * u.arcsec)
The same shape on the SQL side is an inline subquery in the FROM (or JOIN-RHS) position:
SELECT a.id, b.id AS b_id
FROM (SELECT * FROM a WHERE phot_g_mean_mag < 16) AS a
JOIN b ON XMATCH(radius_arcsec => 1.0)
Filters applied after a join are post-filters on the matched frame
and reference the merged-column names (d, mag_b):
near_bright = (a
.crossmatch(b, radius=5 * u.arcsec, dist_col="d")
.where("d < 1.0") # tighter than the search radius
.where("mag_b < 20")) # b's columns named after suffixing
For the inline-subquery pre-filter shape (CTEs / FROM (SELECT ...)),
see the SQL escape hatch; the
inner-WHERE grammar is restricted there.
Pushdown — what acid actually reads from disk¶
ACID is built on Polars and reads parquet through scan_parquet.
That gives you two pushdowns for free:
- Column pruning. Only the columns you reference (in
select,where, aggregates, sort keys) are read from disk. Awhereon a single column does not pull the other 149 columns of a wide catalog. - Row-group statistics. Range predicates (
<,<=,>=,>,BETWEEN) on a column with per-row-group min/max statistics let Polars skip whole row groups whose range does not overlap the predicate. This is dramatic on sorted or near-sorted columns.
The most important case is _healpix_29, the integer column HATS
catalogs carry that encodes each row's HEALPix nested index at order 29.
Spatial filters (cone restriction, MOC filtering — see
Sky regions & footprints) get rewritten internally
into integer range predicates on _healpix_29, which prunes whole row
groups before Polars decompresses a single page. It is the trick that
makes per-tuple cone filtering fast on billion-row catalogs.
You do not write _healpix_29 predicates by hand. ACID emits them for
you whenever you use a spatial filter — the column is mentioned here
only so you know why those spatial filters fly.
Push filters down by writing them first
Polars' query optimizer reorders where / select, so the order
you write them in usually doesn't matter for correctness. But
.where(...) before .crossmatch(...) on the catalog being
matched is meaningfully different from a post-join filter — see
the section above.
What doesn't work¶
LIKE/ regex predicates inside an inline subquery pre-filter (theFROM (SELECT ... WHERE ...)form). Top-levelWHERE/ fluent.whereaccept the full Polars-SQL expression grammar.- Subqueries inside
WHERE. The supported subquery shape is(SELECT * FROM <registered_catalog> WHERE <row-level predicate>)in FROM / JOIN position — see the SQL escape hatch. WHEREpredicates on aggregate outputs — useHAVING(or a fluent post-aggregate.where(...)); see aggregation.
See also¶
- Crossmatching catalogs — combining row filters with a crossmatch, the pre-filter / post-filter split.
- Aggregating — counting filtered rows;
HAVINGfor post-aggregate filters. - Sky regions & footprints — for spatial cuts
(cones, MOCs,
in_region,IN_MOC). - SQL escape hatch —
acid.sql.query(...)for query shapes the fluent verbs don't cover.