Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.

like image 693
marcin_koss Avatar asked Sep 12 '25 07:09

marcin_koss


1 Answers

Lets start with basics:

Parquet is a file format that needs to be saved in a file system.

Key questions:

  1. Does parquet support append operations?
  2. Does the file system (namely, HDFS) allow append on files?
  3. Can the job framework (Spark) implement append operations?

Answers:

  1. parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)

  2. HDFS allows append on files using the dfs.support.append property

  3. Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA

It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

like image 139
KrazyGautam Avatar answered Sep 14 '25 21:09

KrazyGautam