Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I process a DataFrame using Polars without constructing the entire output in memory?

To load a large dataset into Polars efficiently one can use the lazy API and the scan_* functions. This works well when we are performing an aggregation (so we have a big input dataset but a small result). However, if I want to process a big dataset in it's entirety (for example, change a value in each row of a column), it seems that there is no way around using collect and loading the whole (result) dataset into memory.

Is it instead possible to write a LazyFrame to disk directly, and have the processing operate on chunks of the dataset sequentially, in order to limit memory usage?

like image 671
nardi Avatar asked Dec 08 '25 19:12

nardi


1 Answers

Edit (2023-01-08)

Polars' has growing support for streaming/out of core processing.

To run a query streaming collect your LazyFrame with collect(streaming=True).

If the result does not fit into memory, try to sink it to disk with sink_parquet.

Old answer (not true anymore).

Polars' algorithms are not streaming, so they need all data in memory for the operations like join, groupby, aggregations etc. So writing to disk directly would still have those intermediate DataFrames in memory.

There are of course things you can do. Depending on the type of query you do, it may lend itself to embarrassingly parallellizaton. A sum could for instance easily be computed in chunks.

You could also process columns in smaller chunks. This allows you to still compute harder aggregations/ computations.

Use lazy

If you have many filters in your query and polars is able to do them at the scan, your memory pressure is reduced to the selectivity ratio.

like image 142
ritchie46 Avatar answered Dec 12 '25 09:12

ritchie46



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!