Append a new column to an existing parquet file

Tags:

Is there any way to append a new column to an existing parquet file?

I'm currently working on a kaggle competition, and I've converted all the data to parquet files.

Here was the case, I read the parquet file into pyspark DataFrame, did some feature extraction and appended new columns to DataFrame with

pysaprk.DataFrame.withColumn().

After that, I want to save the new columns in the source parquet file.

I know Spark SQL come with Parquet schema evolution, but the example only have shown the case with a key-value.

The parquet "append" mode doesn't do the trick either. It only append new rows to the parquet file. If there's anyway to append a new column to an existing parquet file instead of generate the whole table again? Or I have to generate a separate new parquet file and join them on the runtime.

795

asked Aug 04 '15 15:08

Chu-Yu Hsu

1 Answers

Yes, it possible with both Databricks Delta as well as with parquet tables. An example is given below:-

This Example wrote in python (pySpark)

df = sqlContext.createDataFrame([('1','Name_1','Address_1'),('2','Name_2','Address_2'),('3','Name_3','Address_3')], schema=['ID', 'Name', 'Address'])

delta_tblNm = 'testDeltaSchema.test_delta_tbl'
parquet_tblNm = 'testParquetSchema.test_parquet_tbl'

delta_write_loc = 'dbfs:///mnt/datalake/stg/delta_tblNm'
parquet_write_loc = 'dbfs:///mnt/datalake/stg/parquet_tblNm'


# DELTA TABLE
df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(delta_write_loc)
spark.sql(" create table if not exists {} using DELTA LOCATION '{}'".format(delta_tblNm, delta_write_loc))
spark.sql("refresh table {}".format(print(cur_tblNm)))

# PARQUET TABLE
df.write.format("parquet").mode("overwrite").save(parquet_write_loc)
spark.sql("""CREATE TABLE if not exists {} USING PARQUET LOCATION '{}'""".format(parquet_tblNm, parquet_write_loc))
spark.sql(""" REFRESH TABLE {} """.format(parquet_tblNm))

test_df = spark.sql("select * testDeltaSchema.test_delta_tbl")
test_df.show()

test_df = spark.sql("select * from testParquetSchema.test_parquet_tbl")
test_df.show()

test_df = spark.sql("ALTER TABLE  testDeltaSchema.test_delta_tbl ADD COLUMNS (Mob_number String COMMENT 'newCol' AFTER Address)")
test_df.show()

test_df = spark.sql("ALTER TABLE  testParquetSchema.test_parquet_tbl ADD COLUMNS (Mob_number String COMMENT 'newCol' AFTER Address)")
test_df.show()

121

answered Sep 21 '22 14:09

Sandy

Related questions
                            
                                How does computing table stats in hive or impala speed up queries in Spark SQL?
                            
                                Spark Shuffle - How workers know where to pull data from
                            
                                pyspark csv at url to dataframe, without writing to disk
                            
                                Spark: Order of column arguments in repartition vs partitionBy
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition
                            
                                How do I apply schema with nullable = false to json reading
                            
                                Why does the Spark DataFrame conversion to RDD require a full re-mapping?
                            
                                PySpark distributed processing on a YARN cluster
                            
                                How do I visualise / plot a decision tree in Apache Spark (PySpark 1.4.1)?
                            
                                Where does spark look for text files?
                            
                                Spark DataFrame InsertIntoJDBC - TableAlreadyExists Exception
                            
                                How to pass data from Kafka to Spark Streaming?
                            
                                Spark Driver Memory and Executor Memory
                            
                                Retain keys with null values while writing JSON in spark
                            
                                How to detect Databricks environment programmatically
                            
                                Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"
                            
                                How to convert spark SchemaRDD into RDD of my case class?
                            
                                "No Filesystem for Scheme: gs" when running spark job locally
                            
                                Running Spark jobs on a YARN cluster with additional files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Append a new column to an existing parquet file

Tags:

apache-spark

apache-spark-sql

parquet

Chu-Yu Hsu

People also ask

1 Answers

Sandy

Recent Activity

Donate For Us