If you encounter slow SQL queries or data processing in general, it might indicate a problem with your data and how it’s stored. Even if the delay is only a few minutes, the iterative and exploratory nature of working with data can cripple productivity (and sanity). I’ve often observed that people don’t question this lack of performance because they assume that processing larger amounts of data inevitably takes time. While that can indeed be the case, it’s not always true!
There are various ways to store data, and each comes with its own trade-offs in terms of performance. With query engines like Trino (distributed) or DuckDB (local), the initial choice usually boils down to either a row-based format (e.g., Avro) or a column-based one (e.g., Parquet). Note that this most likely does not apply to cases where a data warehouse solution like BigQuery is used, as such systems automatically convert the ingested data to a suitable format.
In most ML settings, a column-based format is often preferable because we usually deal with many columns but require only a subset at a time. This ensures that query engines can read only the required columns, reducing I/O and speeding up queries. More generally, columnar formats enable predicate pushdown, allowing filters to be applied at the storage level (see, for example, the Athena user guide). For query engines like Athena or BigQuery, this can also lead to decreased cloud costs, as billing is based on the amount of data processed per query. Furthermore, columnar formats support efficient compression algorithms that reduce data size without significantly impacting decompression speed. Smaller data sizes mean less data to read from disk, which directly improves query performance.
In contrast, row-based formats are more suitable for OLTP (Online Transaction Processing) and streaming use cases, such as when processing and writing data with Apache Flink. If you are ingesting data via streaming, it might be preferable to initially use a row-based format for efficient data processing. However, it’s often advantageous to later convert this data into a columnar format that is more suitable for data analysis.