I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything.
Here is my code:
from s3fs.core import S3FileSystem
import pyarrow as pa
import pyarrow.parquet as pq
s3 = S3FileSystem(anon=False)
output_dir = "s3://mybucket/output/my_table"
my_csv = pd.read_csv(file.csv)
my_table = pa.Table.from_pandas(my_csv , preserve_index=False)
pq.write_to_dataset(my_table,
output_dir,
filesystem=s3,
use_dictionary=True,
compression='snappy')
Is there something like mode = "overwrite"
option in write_to_dataset function?
I think the best way to do it is with AWS Data Wrangler that offers 3 differents write modes:
Example:
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With