I should begin by saying that this is not running in Spark.
What I am attempting to do is
Have tried various things like:
from pyarrow import fs
from pyarrow.parquet import ParquetFile
s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
print(type(f)) # pyarrow.lib.NativeFile
parquet_file = ParquetFile(f)
for i in parquet_file.iter_batches(): # .read_row_groups() would be better
# process
...but getting OSError: only valid on seekable files , and not sure how to get around it.
Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.
Try using open_input_file which 'Open an input file for random access reading.' instead of open_input_stream which 'Open an input stream for sequential reading.'
For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With