Deep Water Soloing at Tonsai Bay

I'm Mike Czech, a software engineer and data scientist living in Hamburg, Germany. I work on autonomous driving safety at MOIA and write about data, machine learning, software engineering, travel, and other things I’m learning.
Recently, I’ve been working more with dbt again and came across a useful way to handle questionable rows: configure a test to warn and store its failures.
{{ config(
severity = 'warn',
store_failures = true
) }}
select *
from {{ ref('some_model') }}
where ...
With dbt test, this writes the failing rows to a table instead of stopping the whole pipeline. That makes it a handy way to flag invalid or suspicious records and keep them available for investigation.
I’m currently working with large Polars dataframes (20M+ rows) and have noticed that using an inner join to filter on a categorical column can be significantly slower than using is_in. There’s a good example on GitHub that demonstrates this issue:
In [156]: N = 10
In [157]: df = pl.DataFrame({"x": pl.Series(range(N))})
In [158]: %timeit -n10 -r10 df.filter(pl.col("x").is_in(df.select("x").to_series() + 1))
120 µs ± 12.3 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
In [159]: %timeit -n10 -r10 df.join(df.select("x") + 1, on="x")
865 µs ± 101 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
The somewhat obvious lesson here is to avoid using significantly more expensive operations when a much simpler and more appropriate alternative is available.