It seems like the streaming temp file location default is "tmp/polars", and I keep running out of disk space when do a streaming .collect().
However, my AWS Sagemaker instance has 1TB of disk space on another drive. How do I make Polars cache data in a specific location?
I don't think the POLARS_*
variables are currently documented.
It is POLARS_TEMP_DIR
let tmp = std::env::var("POLARS_TEMP_DIR")
Example:
import os
os.environ["POLARS_TEMP_DIR"] = "./foobar/"
os.environ["POLARS_FORCE_OOC"] = "1"
os.environ["POLARS_VERBOSE"] = "1"
import polars as pl
(pl.LazyFrame({"x": range(1_000_000)})
.sort(pl.all())
.collect(streaming=True)
)
Output:
OOC sort forced
OOC sort started
Temporary directory path in use: ./foobar/
RUN STREAMING PIPELINE
[df -> sort -> ordered_sink]
finished sinking into OOC sort in 2.036781ms
full file dump of OOC sort took 38.563679ms
spill size: 64 mb
processing 2 files
partitioning sort took: 32.574176ms
started sort source phase
sort source phase took: 11.992426ms
full ooc sort took: 83.922014ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With