Can I process a DataFrame using Polars without constructing the entire output in memory?

Question

To load a large dataset into Polars efficiently one can use the lazy API and the scan_* functions. This works well when we are performing an aggregation (so we have a big input dataset but a small result). However, if I want to process a big dataset in it's entirety (for example, change a value in each row of a column), it seems that there is no way around using collect and loading the whole (result) dataset into memory.

Is it instead possible to write a LazyFrame to disk directly, and have the processing operate on chunks of the dataset sequentially, in order to limit memory usage?

ritchie46 · Accepted Answer

Edit (2023-01-08)

Polars' has growing support for streaming/out of core processing.

To run a query streaming collect your LazyFrame with collect(streaming=True).

If the result does not fit into memory, try to sink it to disk with sink_parquet.

Old answer (not true anymore).

Polars' algorithms are not streaming, so they need all data in memory for the operations like join, groupby, aggregations etc. So writing to disk directly would still have those intermediate DataFrames in memory.

There are of course things you can do. Depending on the type of query you do, it may lend itself to embarrassingly parallellizaton. A sum could for instance easily be computed in chunks.

You could also process columns in smaller chunks. This allows you to still compute harder aggregations/ computations.

Use lazy

If you have many filters in your query and polars is able to do them at the scan, your memory pressure is reduced to the selectivity ratio.

Can I process a DataFrame using Polars without constructing the entire output in memory?

Tags:

python-polars

nardi

1 Answers

Edit (2023-01-08)

Old answer (not true anymore).

Use lazy

ritchie46

Recent Activity

Donate For Us

Can I process a DataFrame using Polars without constructing the entire output in memory?

Tags:

python-polars

nardi

1 Answers

Edit (2023-01-08)

Old answer (not true anymore).

Use lazy

ritchie46

Related questions

Recent Activity

Donate For Us