Pyarrow read/write from s3

Question

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow.

Here is my code:

import pyarrow.parquet as pq
import pyarrow as pa
import s3fs

s3 = s3fs.S3FileSystem()

bucket = 'demo-s3'

pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')

mdurant · Accepted Answer

If you do not wish to copy the files directly, it appears you can indeed avoid pandas thus:

table = pq.ParquetDataset('s3://{0}/old'.format(bucket),
    filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), 
    filesystem=s3, use_dictionary=True, compression='snappy')

Pyarrow read/write from s3

Tags:

python

pyarrow

thotam

1 Answers

mdurant

Recent Activity

Donate For Us

Pyarrow read/write from s3

Tags:

python

pyarrow

thotam

1 Answers

mdurant

Related questions

Recent Activity

Donate For Us