I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.
rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))
or
rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))
or
rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)
don't work. Can anybody suggest a solution how to make one RDD of a few input text files?
Thanks in advance...
This should load all the files matching the pattern.
rdd = sc.textFile("file:///path/*.txt")
Now, you do not need to do any union. You have only one RDD.
Coming to your question - why are you getting many output files
. The number of output files depends on number of partitions
in the RDD
. When you run word count logic, your resultant RDD can have more than 1 partitions. If you want to save the RDD as single file, use coalesce
or repartition
to have only one partition.
The code below works, taken from Examples.
rdd = sc.textFile("file:///path/*.txt")
counts = rdd.flatMap(lambda line: line.split(" ")) \
... .map(lambda word: (word, 1)) \
... .reduceByKey(lambda a, b: a + b)
counts.coalesce(1).saveAsTextFile("res.csv")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With