How can i write a dataframe having same column name after join operation into a csv file. Currently i am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')
which will write the dataframe "dfFinal" in "/home/user/output".But it is not working in situaton that the dataframe contains a duplicate column. Below is the dfFinal dataframe.
+----------+---+-----------------+---+-----------------+
| NUMBER | ID|AMOUNT | ID| AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
The above dataframe is formed after a join operation. When writing to a csv file it is giving me the following error.
pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'
In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj.write.csv ("path"), using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.
Use the write () method of the PySpark DataFrameWriter object to export PySpark DataFrame to a CSV file. Using this you can save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names.
In PySpark, we can write the CSV file into the Spark DataFrame and read the CSV file. In addition, the PySpark provides the option () function to customize the behavior of reading and writing operations such as character set, header, and delimiter of CSV file as per our requirement. PySpark CSV helps us to minimize the input and output operation.
If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
When you specifiy the join column as string or array type it will lead to only one column [1]. Pyspark example:
l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)]
r = [(1,40),(2,50),(3,60),(4,70)]
left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])
df = left.join(right, "ID")
df.show()
+---+----------+------+------+
| ID| NUMBER |AMOUNT|AMOUNT|
+---+----------+------+------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+------+------+
But this will still produce duplicate column names in the dataframe for all columns which aren't a join column (AMOUNT column in this example). For these type of columns you should assign a new name before or after the join with the toDF dataframe function [2]:
newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()
+---+----------+-------+-------+
| ID| NUMBER |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+-------+-------+
[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With