Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write dataframe with duplicate column name into a csv file in pyspark

How can i write a dataframe having same column name after join operation into a csv file. Currently i am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')which will write the dataframe "dfFinal" in "/home/user/output".But it is not working in situaton that the dataframe contains a duplicate column. Below is the dfFinal dataframe.

+----------+---+-----------------+---+-----------------+
|  NUMBER  | ID|AMOUNT           | ID|           AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092|  1|               30|  1|               40|
|9090909093|  2|               30|  2|               50|
|9090909090|  3|               30|  3|               60|
|9090909094|  4|               30|  4|               70|
+----------+---+-----------------+---+-----------------+

The above dataframe is formed after a join operation. When writing to a csv file it is giving me the following error.

pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'
like image 234
Nandu Avatar asked Oct 03 '18 10:10

Nandu


People also ask

How to extract data from a Dataframe in pyspark?

In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj.write.csv ("path"), using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.

How do I write a Dataframe to a CSV file in Python?

Use the write () method of the PySpark DataFrameWriter object to export PySpark DataFrame to a CSV file. Using this you can save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names.

How to read and write CSV file in pyspark?

In PySpark, we can write the CSV file into the Spark DataFrame and read the CSV file. In addition, the PySpark provides the option () function to customize the behavior of reading and writing operations such as character set, header, and delimiter of CSV file as per our requirement. PySpark CSV helps us to minimize the input and output operation.

How to drop the duplicate column from a Dataframe in Python?

If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.


1 Answers

When you specifiy the join column as string or array type it will lead to only one column [1]. Pyspark example:

l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)] 
r = [(1,40),(2,50),(3,60),(4,70)]

left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])

df = left.join(right, "ID")
df.show()

+---+----------+------+------+
| ID| NUMBER   |AMOUNT|AMOUNT|
+---+----------+------+------+ 
| 1 |9090909092| 30   | 40   |
| 3 |9090909090| 30   | 60   |
| 2 |9090909093| 30   | 50   |
| 4 |9090909094| 30   | 70   |
+---+----------+------+------+

But this will still produce duplicate column names in the dataframe for all columns which aren't a join column (AMOUNT column in this example). For these type of columns you should assign a new name before or after the join with the toDF dataframe function [2]:

newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()

+---+----------+-------+-------+ 
| ID| NUMBER   |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+ 
| 1 |9090909092| 30    | 40    | 
| 3 |9090909090| 30    | 60    | 
| 2 |9090909093| 30    | 50    | 
| 4 |9090909094| 30    | 70    | 
+---+----------+-------+-------+

[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF

like image 102
cronoik Avatar answered Sep 22 '22 16:09

cronoik