Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow.
Here is my code:
import pyarrow.parquet as pq
import pyarrow as pa
import s3fs
s3 = s3fs.S3FileSystem()
bucket = 'demo-s3'
pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')
If you do not wish to copy the files directly, it appears you can indeed avoid pandas thus:
table = pq.ParquetDataset('s3://{0}/old'.format(bucket),
filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket),
filesystem=s3, use_dictionary=True, compression='snappy')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With