pyspark

Question

I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.

rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))

or

rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))

or

rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)

don't work. Can anybody suggest a solution how to make one RDD of a few input text files?

Thanks in advance...

Mohitt · Accepted Answer

This should load all the files matching the pattern.

rdd = sc.textFile("file:///path/*.txt")

Now, you do not need to do any union. You have only one RDD.

Coming to your question - why are you getting many output files. The number of output files depends on number of partitions in the RDD. When you run word count logic, your resultant RDD can have more than 1 partitions. If you want to save the RDD as single file, use coalesce or repartition to have only one partition.

The code below works, taken from Examples.

rdd = sc.textFile("file:///path/*.txt")
counts = rdd.flatMap(lambda line: line.split(" ")) \
...              .map(lambda word: (word, 1)) \
...              .reduceByKey(lambda a, b: a + b)

counts.coalesce(1).saveAsTextFile("res.csv")

pyspark - multiple input files into one RDD and one output file

Tags:

python

apache-spark

hadoop

mapreduce

piterd

1 Answers

Mohitt

Recent Activity

Donate For Us