A classic performance optimization when dealing with larger amounts of data relates to typical usage patterns: Like requiring only a subset of columns, we often want to only use a subset of rows. For example, we might only consider a given time period like all the rows from the last three months. This is where table partitioning is helpful to divide the data into more manageable blocks. The result is faster (and likely cheaper) queries if the queries filter by a given partitioning scheme. Note that your data processing engine and storage determine how the definition of partitions is implemented. For example, Hive-style partitioning separates data in folders:
purchases/date=20240801/
fileA.parquet
fileB.parquet
fileC.parquet
purchases/date=20240802/
fileA.parquet
fileB.parquet
fileC.parquet
Suppose you now want to run a query like SELECT count(*) FROM purchases WHERE date='20240802'
, then the query engine only needs to read the files from purchases/date=20240802
, skipping all the other data files. Hive-style partitioning works well, but also comes with several limitations with respect to exacerbating the small file problem, processing overhead (file listings), and operational complexity (e.g. when altering partitions). In contrast, modern data formats like Apache Iceberg and Delta Lake provide more sophisticated partitioning techniques, mitigating some of the drawbacks. Besides performance-related aspects, the concept of immutable partitioning itself leads to interesting opportunities for building robust and reproducible data pipelines.
In addition to table partitioning, table clustering (or sometimens called bucketing) sorts related rows (according to a given set of columns) within a partition to align the order to common query usage patterns, enhancing performance. It’s often advisable to consider both partitioning and clustering for achieving maximum performance, though the former is certainly more critical for all workloads that process larger amounts of data.