I'd like to read a partitioned parquet file into a polars dataframe.
In spark, it is simple:
df = spark.read.parquet("/my/path")
The polars documentation says that it should work the same way:
df = pl.read_parquet("/my/path")
But it gives me the error:
raise IsADirectoryError(f"Expected a file path; {path!r} is a directory")
How to read this file?
As an example using S3 (since you say your files are cloud-hosted), you first establish a filesystem connection (via fsspec) and a dataset against it (as suggested by Dean) and then read into polars from that:
from pyarrow.dataset import dataset
from s3fs import S3FileSystem
import polars as pl
# setup cloud filesystem access
cloudfs = S3FileSystem( ... )
# reference multiple parquet files
pyarrow_dataset = dataset(
source = "s3://bucket/path/*.parquet",
filesystem = cloudfs,
format = 'parquet',
)
# load efficiently into polars
ldf = pl.scan_pyarrow_dataset( pyarrow_dataset )
Here's a snippet of the source code:
if isinstance(source, str) and "*" in source and _is_local_file(source):
from polars import scan_parquet
scan = scan_parquet(
source,
n_rows=n_rows,
rechunk=True,
parallel=parallel,
row_count_name=row_count_name,
row_count_offset=row_count_offset,
low_memory=low_memory,
)
The important bit is that it's looking for an *
in the source path.
So it seems you just need to do
df = pl.read_parquet("/my/path/*")
This only works on local filesystems so if you're reading from cloud storage then you'd have to use pyarrow
datasets to read multiple files at once without iterating over them yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With