Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Context Textfile: load multiple files

Tags:

I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.

The following code snippet fails:

for fileEntry in files:     fileName = basePath + "/" + fileEntry     lines = sc.textFile(fileName)     if retval == None:         retval = lines     else:         retval = sc.union(retval, lines) 

This fails on the third loop with the following error message:

retval = sc.union(retval, lines) TypeError: union() takes exactly 2 arguments (3 given) 

Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.

like image 656
Raj Avatar asked Apr 30 '14 21:04

Raj


People also ask

How do I read multiple files in Spark?

Spark – Read multiple text files into single RDD? Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

How read multiple files in Spark which are present in HDFS cluster?

Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.


1 Answers

How about this phrasing instead?

sc.union([sc.textFile(basepath + "/" + f) for f in files]) 

In Scala SparkContext.union() has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).

UPDATE

You can use a single textFile call to read multiple files.

sc.textFile(','.join(files)) 
like image 161
Daniel Darabos Avatar answered Sep 23 '22 19:09

Daniel Darabos