Mike Czech

I'm Mike Czech, a software engineer and data scientist living in Hamburg, Germany. I work on autonomous driving safety at MOIA and write about data, machine learning, software engineering, travel, and other things I’m learning.

I’ve noticed that with AI-assisted coding, it’s becoming even more common to end up with large PRs. This is problematic for several reasons:

  • bugs are harder to spot
  • merge conflicts increase
  • reviews take longer

The interesting thing is that AI coding agents are also really good at splitting large PRs into smaller, more manageable pieces! I’ll often ask an agent to identify independent changes, separate refactors from functional changes, and suggest a sequence of smaller PRs.

To me, this shows that AI-assisted coding isn’t just about producing more code in less time. It can also help reinforce good engineering practices, which are necessary if you want to scale development over time.

One of my favourite new tricks is to be a little more verbose in Slack and then use Claude and Slack MCP to generate a pull request from the discussion. That way, ideas from our Slack discussions make their way into the product almost immediately!

It’s a small example of how software engineering is becoming less about producing code and more about collaboratively shaping the product.

Recently, I’ve been working more with dbt again and came across a useful way to handle questionable rows: configure a test to warn and store its failures.

{{ config(
    severity = 'warn',
    store_failures = true
) }}

select *
from {{ ref('some_model') }}
where ...

With dbt test, this writes the failing rows to a table instead of stopping the whole pipeline. That makes it a handy way to flag invalid or suspicious records and keep them available for investigation.

I’m currently working with large Polars dataframes (20M+ rows) and have noticed that using an inner join to filter on a categorical column can be significantly slower than using is_in. There’s a good example on GitHub that demonstrates this issue:

In [156]: N = 10

In [157]: df = pl.DataFrame({"x": pl.Series(range(N))})

In [158]: %timeit -n10 -r10 df.filter(pl.col("x").is_in(df.select("x").to_series() + 1))
120 µs ± 12.3 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

In [159]: %timeit -n10 -r10 df.join(df.select("x") + 1, on="x")
865 µs ± 101 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

The somewhat obvious lesson here is to avoid using significantly more expensive operations when a much simpler and more appropriate alternative is available.