Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I change the temp file location of the Polars streaming API?

It seems like the streaming temp file location default is "tmp/polars", and I keep running out of disk space when do a streaming .collect().

However, my AWS Sagemaker instance has 1TB of disk space on another drive. How do I make Polars cache data in a specific location?

like image 952
Kyle Gilde Avatar asked Sep 13 '25 06:09

Kyle Gilde


1 Answers

I don't think the POLARS_* variables are currently documented.

It is POLARS_TEMP_DIR

let tmp = std::env::var("POLARS_TEMP_DIR")

Example:

import os

os.environ["POLARS_TEMP_DIR"] = "./foobar/"
os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_VERBOSE"] = "1"

import polars as pl

(pl.LazyFrame({"x": range(1_000_000)})
   .sort(pl.all())
   .collect(streaming=True)
)

Output:

OOC sort forced
OOC sort started
Temporary directory path in use: ./foobar/
RUN STREAMING PIPELINE
[df -> sort -> ordered_sink]
finished sinking into OOC sort in 2.036781ms
full file dump of OOC sort took 38.563679ms
spill size: 64 mb
processing 2 files
partitioning sort took: 32.574176ms
started sort source phase
sort source phase took: 11.992426ms
full ooc sort took: 83.922014ms
like image 129
jqurious Avatar answered Sep 15 '25 19:09

jqurious