Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming parquet files from S3 (Python)

I should begin by saying that this is not running in Spark.

What I am attempting to do is

  • stream n records from a parquet file in S3
  • process
  • stream back to a different file in S3 ...but am only inquiring about the first step.

Have tried various things like:

from pyarrow import fs
from pyarrow.parquet import ParquetFile

s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
    print(type(f))  # pyarrow.lib.NativeFile
    parquet_file = ParquetFile(f)
    for i in parquet_file.iter_batches():  # .read_row_groups() would be better
        # process

...but getting OSError: only valid on seekable files , and not sure how to get around it.

Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.

like image 649
Damian Satterthwaite-Phillips Avatar asked Oct 21 '25 14:10

Damian Satterthwaite-Phillips


1 Answers

Try using open_input_file which 'Open an input file for random access reading.' instead of open_input_stream which 'Open an input stream for sequential reading.'

For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.

like image 141
0x26res Avatar answered Oct 23 '25 03:10

0x26res



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!