I'm using the following code to create ParquetWriter and to write records to it.
ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
final GenericRecord record = new GenericData.Record(avroSchema);
parquetWriter.write(record);
But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case.
The version of parquetwrite introduced in R2019a does not currently support appending to preexisting Parquet files on disk.
Append or Overwrite an existing Parquet fileUsing append save mode, you can append a dataframe to an existing parquet file. Incase to overwrite use overwrite save mode.
The Apache Parquet Merge tool is an interactive, command line tool that merges multiple Parquet table increment files into a single table increment file that contains the merged segments.
There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html which I believe solves your problem.
Example of use:
df.write.mode('append').parquet('parquet_data_file')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With