I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.
The following code snippet fails:
for fileEntry in files: fileName = basePath + "/" + fileEntry lines = sc.textFile(fileName) if retval == None: retval = lines else: retval = sc.union(retval, lines)
This fails on the third loop with the following error message:
retval = sc.union(retval, lines) TypeError: union() takes exactly 2 arguments (3 given)
Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.
Spark – Read multiple text files into single RDD? Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.
How about this phrasing instead?
sc.union([sc.textFile(basepath + "/" + f) for f in files])
In Scala SparkContext.union()
has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).
UPDATE
You can use a single textFile
call to read multiple files.
sc.textFile(','.join(files))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With