To load a large dataset into Polars efficiently one can use the lazy API and the scan_* functions. This works well when we are performing an aggregation (so we have a big input dataset but a small result). However, if I want to process a big dataset in it's entirety (for example, change a value in each row of a column), it seems that there is no way around using collect and loading the whole (result) dataset into memory.
Is it instead possible to write a LazyFrame to disk directly, and have the processing operate on chunks of the dataset sequentially, in order to limit memory usage?
Polars' has growing support for streaming/out of core processing.
To run a query streaming collect your LazyFrame with collect(streaming=True).
If the result does not fit into memory, try to sink it to disk with sink_parquet.
Polars' algorithms are not streaming, so they need all data in memory for the operations like join, groupby, aggregations etc. So writing to disk directly would still have those intermediate DataFrames in memory.
There are of course things you can do. Depending on the type of query you do, it may lend itself to embarrassingly parallellizaton. A sum could for instance easily be computed in chunks.
You could also process columns in smaller chunks. This allows you to still compute harder aggregations/ computations.
If you have many filters in your query and polars is able to do them at the scan, your memory pressure is reduced to the selectivity ratio.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With