Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to append data to an existing parquet file

I'm using the following code to create ParquetWriter and to write records to it.

ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);

final GenericRecord record = new GenericData.Record(avroSchema);

parquetWriter.write(record);

But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case.

like image 817
Krishas Avatar asked Aug 30 '16 18:08

Krishas


People also ask

Can I append data to Parquet file?

The version of parquetwrite introduced in R2019a does not currently support appending to preexisting Parquet files on disk.

Can you append to a Parquet file Python?

Append or Overwrite an existing Parquet fileUsing append save mode, you can append a dataframe to an existing parquet file. Incase to overwrite use overwrite save mode.

Can we merge Parquet files?

The Apache Parquet Merge tool is an interactive, command line tool that merges multiple Parquet table increment files into a single table increment file that contains the merged segments.


1 Answers

There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html which I believe solves your problem.

Example of use:

df.write.mode('append').parquet('parquet_data_file')
like image 185
bluszcz Avatar answered Oct 07 '22 17:10

bluszcz