Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark - multiple input files into one RDD and one output file

I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.

rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))

or

rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))

or

rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)

don't work. Can anybody suggest a solution how to make one RDD of a few input text files?

Thanks in advance...

like image 827
piterd Avatar asked Jan 07 '23 14:01

piterd


1 Answers

This should load all the files matching the pattern.

rdd = sc.textFile("file:///path/*.txt")

Now, you do not need to do any union. You have only one RDD.

Coming to your question - why are you getting many output files. The number of output files depends on number of partitions in the RDD. When you run word count logic, your resultant RDD can have more than 1 partitions. If you want to save the RDD as single file, use coalesce or repartition to have only one partition.

The code below works, taken from Examples.

rdd = sc.textFile("file:///path/*.txt")
counts = rdd.flatMap(lambda line: line.split(" ")) \
...              .map(lambda word: (word, 1)) \
...              .reduceByKey(lambda a, b: a + b)

counts.coalesce(1).saveAsTextFile("res.csv")
like image 136
Mohitt Avatar answered Jan 10 '23 19:01

Mohitt