Notes

Short thoughts, moments, and learning notes.

Recently, I’ve been working more with dbt again and came across a useful way to handle questionable rows: configure a test to warn and store its failures.

{{ config(
    severity = 'warn',
    store_failures = true
) }}

select *
from {{ ref('some_model') }}
where ...

With dbt test, this writes the failing rows to a table instead of stopping the whole pipeline. That makes it a handy way to flag invalid or suspicious records and keep them available for investigation.

I’m currently working with large Polars dataframes (20M+ rows) and have noticed that using an inner join to filter on a categorical column can be significantly slower than using is_in. There’s a good example on GitHub that demonstrates this issue:

In [156]: N = 10

In [157]: df = pl.DataFrame({"x": pl.Series(range(N))})

In [158]: %timeit -n10 -r10 df.filter(pl.col("x").is_in(df.select("x").to_series() + 1))
120 µs ± 12.3 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

In [159]: %timeit -n10 -r10 df.join(df.select("x") + 1, on="x")
865 µs ± 101 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

The somewhat obvious lesson here is to avoid using significantly more expensive operations when a much simpler and more appropriate alternative is available.