Skip to content

Filtering rows

Most queries start with a row filter: keep sources brighter than some magnitude, drop bad-quality flags, take only objects with finite photometry. This page covers ACID's plain row-filtering surface:

  • Catalog.where("<predicate>") (fluent) / SQL WHERE clauses;
  • how predicates compose (AND / OR / parentheses);
  • NULL handling (IS NULL, IS NOT NULL);
  • which filters get pushed all the way down to parquet — the _healpix_29 row-group story.

Spatial filters — restricting to a sky cone or a survey-footprint MOC — have their own page, Sky regions & footprints. The two layers compose: a where("g_mag < 18") and an in_region("des_dr2") on the same Catalog both apply.

The shortest filter

The fluent surface and SQL accept the same predicate strings — both are parsed by SQLGlot and lowered through Polars-SQL. Pick the surface that matches the rest of your code.

bright_sources.py
import acid

acid.init("catalogs.yaml", workers=8)
gaia = acid.open("gaia_dr3")

bright = (gaia
          .where("phot_g_mean_mag < 16")
          .select("source_id, ra, dec, phot_g_mean_mag"))

df = bright.to_polars()
import acid

acid.init("catalogs.yaml", workers=8)
r = acid.sql.query("""
    SELECT source_id, ra, dec, phot_g_mean_mag
    FROM   gaia_dr3
    WHERE  phot_g_mean_mag < 16
""")
df = r.to_polars()

Catalog.where(...) is sticky on the catalog — calling it twice composes the predicates with AND. The two snippets below produce the same plan:

gaia.where("phot_g_mean_mag < 16").where("phot_g_mean_mag > 12")
gaia.where("phot_g_mean_mag < 16 AND phot_g_mean_mag > 12")

If you want OR, write it inline:

gaia.where("phot_g_mean_mag < 12 OR phot_g_mean_mag > 18")

What the predicate looks like

The fluent surface works with flat column names — the ones cat.columns reports — not alias.col:

gaia.where("phot_g_mean_mag < 16")           # OK
gaia.where("gaia.phot_g_mean_mag < 16")      # not how the fluent surface works

The SQL escape hatch (acid.sql.query(...)) still parses real SQL and accepts alias.col:

SELECT * FROM gaia_dr3 AS g WHERE g.phot_g_mean_mag < 16

The expression grammar is the SQL one — comparisons, AND / OR / NOT, parentheses, arithmetic, IS NULL / IS NOT NULL, BETWEEN, IN (...), simple scalar functions. (Inside inline subquery / CTE pre-filters the allowed set is a narrower allowlist; see the SQL escape hatch page for that case.)

Composing predicates

Color cuts and AND-chains

clean = (gaia
         .where("phot_g_mean_mag IS NOT NULL")
         .where("phot_g_mean_mag BETWEEN 14 AND 18")
         .where("bp_rp BETWEEN 0.5 AND 1.5"))

The same as one big predicate; pick whichever reads better:

clean = gaia.where(
    "phot_g_mean_mag IS NOT NULL "
    "AND phot_g_mean_mag BETWEEN 14 AND 18 "
    "AND bp_rp BETWEEN 0.5 AND 1.5"
)

Disjunctions and parentheses

When mixing AND and OR, parenthesize. SQL's precedence puts AND ahead of OR, which is rarely what you want:

gaia.where("(quality_flag = 0 OR quality_flag = 2) AND phot_g_mean_mag < 20")

IN (...) and lists of values

gaia.where("band IN ('g', 'r', 'i')")
gaia.where("source_id IN (4295806720, 4295806721, 4295806722)")

The right-hand side must be a list of literals on the fluent surface. For "is this column in another column", drop into SQL.

NULL handling

NULLs propagate through SQL the usual way: NULL < 18 is NULL, not TRUE or FALSE. Row filters require an explicit IS NULL / IS NOT NULL:

# Sources where photometry exists at all:
have_g = gaia.where("phot_g_mean_mag IS NOT NULL")

# The classic anti-join shape — see crossmatch.md:
unmatched = (a.crossmatch(b, radius=1 * u.arcsec, how="left", dist_col="d")
              .where("d IS NULL"))

d IS NULL after a how="left" crossmatch is how you ask for every anchor without a counterpart. Plain d = NULL is always NULL and filters nothing.

Filtering before vs. after a join

Filters applied before a join run as a pre-filter on the catalog they were attached to:

# `bright_a` is filtered before the matcher sees it.
bright_a = a.where("phot_g_mean_mag < 16")
matches  = bright_a.crossmatch(b, radius=1 * u.arcsec)

This is a real perf win on tight cuts — the matcher only ever sees the bright anchors.

To pre-filter the right operand instead, filter it inside the .crossmatch(...) call (the operand is a real subtree). With maxmatch=1 (nearest) this gives "nearest surviving match" semantics: the matcher only sees the bright right-side sources, so the nearest match is the nearest bright one — not the nearest source, then dropped:

# pre-filter the right side: nearest *bright* counterpart
matches = a.crossmatch(b.where("mag < 20"), radius=1 * u.arcsec)

The same shape on the SQL side is an inline subquery in the FROM (or JOIN-RHS) position:

SELECT a.id, b.id AS b_id
FROM   (SELECT * FROM a WHERE phot_g_mean_mag < 16) AS a
JOIN   b ON XMATCH(radius_arcsec => 1.0)

Filters applied after a join are post-filters on the matched frame and reference the merged-column names (d, mag_b):

near_bright = (a
               .crossmatch(b, radius=5 * u.arcsec, dist_col="d")
               .where("d < 1.0")          # tighter than the search radius
               .where("mag_b < 20"))      # b's columns named after suffixing

For the inline-subquery pre-filter shape (CTEs / FROM (SELECT ...)), see the SQL escape hatch; the inner-WHERE grammar is restricted there.

Pushdown — what acid actually reads from disk

ACID is built on Polars and reads parquet through scan_parquet. That gives you two pushdowns for free:

  • Column pruning. Only the columns you reference (in select, where, aggregates, sort keys) are read from disk. A where on a single column does not pull the other 149 columns of a wide catalog.
  • Row-group statistics. Range predicates (<, <=, >=, >, BETWEEN) on a column with per-row-group min/max statistics let Polars skip whole row groups whose range does not overlap the predicate. This is dramatic on sorted or near-sorted columns.

The most important case is _healpix_29, the integer column HATS catalogs carry that encodes each row's HEALPix nested index at order 29. Spatial filters (cone restriction, MOC filtering — see Sky regions & footprints) get rewritten internally into integer range predicates on _healpix_29, which prunes whole row groups before Polars decompresses a single page. It is the trick that makes per-tuple cone filtering fast on billion-row catalogs.

You do not write _healpix_29 predicates by hand. ACID emits them for you whenever you use a spatial filter — the column is mentioned here only so you know why those spatial filters fly.

Push filters down by writing them first

Polars' query optimizer reorders where / select, so the order you write them in usually doesn't matter for correctness. But .where(...) before .crossmatch(...) on the catalog being matched is meaningfully different from a post-join filter — see the section above.

What doesn't work

  • LIKE / regex predicates inside an inline subquery pre-filter (the FROM (SELECT ... WHERE ...) form). Top-level WHERE / fluent .where accept the full Polars-SQL expression grammar.
  • Subqueries inside WHERE. The supported subquery shape is (SELECT * FROM <registered_catalog> WHERE <row-level predicate>) in FROM / JOIN position — see the SQL escape hatch.
  • WHERE predicates on aggregate outputs — use HAVING (or a fluent post-aggregate .where(...)); see aggregation.

See also