Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write csv file into one file by pyspark

Tags:

pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.

First question: how to achieve this in Python?

In the second post, it is also said a Hadoop function could merge multiple files into one.

Second question: is it possible merge two file in Spark?

like image 926
sydridgm Avatar asked Apr 12 '16 13:04

sydridgm


2 Answers

You can use,

df.coalesce(1).write.csv('result.csv')

Note: when you use coalesce function you will lose your parallelism.

like image 51
Mohamed Thasin ah Avatar answered Sep 17 '22 14:09

Mohamed Thasin ah


Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file. First we can keep a utility function for make data csv compatible.

def toCSVLine(data):
    return ','.join(str(d) for d in data)

Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.

unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
   .union( MyRDD )

unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )

saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.

like image 43
AK Gangoni Avatar answered Sep 21 '22 14:09

AK Gangoni