I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.
The following code snippet fails:
for fileEntry in files:     fileName = basePath + "/" + fileEntry     lines = sc.textFile(fileName)     if retval == None:         retval = lines     else:         retval = sc.union(retval, lines) This fails on the third loop with the following error message:
retval = sc.union(retval, lines) TypeError: union() takes exactly 2 arguments (3 given) Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.
Spark – Read multiple text files into single RDD? Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.
How about this phrasing instead?
sc.union([sc.textFile(basepath + "/" + f) for f in files]) In Scala SparkContext.union() has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).
UPDATE
You can use a single textFile call to read multiple files.
sc.textFile(','.join(files)) If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With