How to write dataframe with duplicate column name into a csv file in pyspark

Tags:

How can i write a dataframe having same column name after join operation into a csv file. Currently i am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')which will write the dataframe "dfFinal" in "/home/user/output".But it is not working in situaton that the dataframe contains a duplicate column. Below is the dfFinal dataframe.

+----------+---+-----------------+---+-----------------+
|  NUMBER  | ID|AMOUNT           | ID|           AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092|  1|               30|  1|               40|
|9090909093|  2|               30|  2|               50|
|9090909090|  3|               30|  3|               60|
|9090909094|  4|               30|  4|               70|
+----------+---+-----------------+---+-----------------+

The above dataframe is formed after a join operation. When writing to a csv file it is giving me the following error.

pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'

234

asked Oct 03 '18 10:10

Nandu

1 Answers

When you specifiy the join column as string or array type it will lead to only one column [1]. Pyspark example:

l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)] 
r = [(1,40),(2,50),(3,60),(4,70)]

left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])

df = left.join(right, "ID")
df.show()

+---+----------+------+------+
| ID| NUMBER   |AMOUNT|AMOUNT|
+---+----------+------+------+ 
| 1 |9090909092| 30   | 40   |
| 3 |9090909090| 30   | 60   |
| 2 |9090909093| 30   | 50   |
| 4 |9090909094| 30   | 70   |
+---+----------+------+------+

But this will still produce duplicate column names in the dataframe for all columns which aren't a join column (AMOUNT column in this example). For these type of columns you should assign a new name before or after the join with the toDF dataframe function [2]:

newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()

+---+----------+-------+-------+ 
| ID| NUMBER   |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+ 
| 1 |9090909092| 30    | 40    | 
| 3 |9090909090| 30    | 60    | 
| 2 |9090909093| 30    | 50    | 
| 4 |9090909094| 30    | 70    | 
+---+----------+-------+-------+

[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF

102

answered Sep 22 '22 16:09

cronoik

Related questions
                            
                                Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]
                            
                                rank() function usage in Spark SQL
                            
                                Spark reading from Postgres JDBC table slow
                            
                                Scala Spark connect to remote cluster
                            
                                Column features must be of type org.apache.spark.ml.linalg.VectorUDT
                            
                                failing to connect to spark driver when submitting job to spark in yarn mode
                            
                                How to convert the group by function to data frame
                            
                                Ubuntu install apache spark via apt-get
                            
                                How can you update values in a dataset?
                            
                                How to add sparse vectors after group by, using Spark SQL?
                            
                                Understanding Apache Spark RDD task serialization
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                How to compute statistics on a streaming dataframe for different type of columns in a single query?
                            
                                ArrayIndexOutOfBoundsException when reading csv file in spark
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView
                            
                                How to write integration tests for Sparks new Structured Streaming?
                            
                                Spark can't find the application class itself (ClassNotFoundException) in spark-submit with SBT assembly JAR
                            
                                How to read a compressed (gzip) file without extension in Spark
                            
                                Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded
                            
                                Spark: aggregate versus map and reduce

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write dataframe with duplicate column name into a csv file in pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

apache-spark-2.0

Nandu

People also ask

1 Answers

cronoik

Recent Activity

Donate For Us