Spark Context Textfile: load multiple files

Tags:

I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.

The following code snippet fails:

for fileEntry in files:     fileName = basePath + "/" + fileEntry     lines = sc.textFile(fileName)     if retval == None:         retval = lines     else:         retval = sc.union(retval, lines)

This fails on the third loop with the following error message:

retval = sc.union(retval, lines) TypeError: union() takes exactly 2 arguments (3 given)

Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.

656

asked Apr 30 '14 21:04

Raj

1 Answers

How about this phrasing instead?

sc.union([sc.textFile(basepath + "/" + f) for f in files])

In Scala SparkContext.union() has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).

UPDATE

You can use a single textFile call to read multiple files.

sc.textFile(','.join(files))

161

answered Sep 23 '22 19:09

Daniel Darabos

Related questions
                            
                                Trying to use PHP DateTime Class with Yii2 receiving class not found errors
                            
                                RGB 565 - Why 6 Bits for Green Color
                            
                                Owl Carousel 2 beta , jump to a specific slide
                            
                                How to set context-param in spring-boot
                            
                                Webpack: expressing module dependency
                            
                                Android Actionbar items as three dots
                            
                                MySQL delete all rows where id is greater than the given number
                            
                                How to make a Stream from a DirectoryStream
                            
                                elixir Logger for lists, tuples, etc
                            
                                How to apply Spring Data projections in a Spring MVC controllers?
                            
                                How to get time using Moment JS
                            
                                Karma can't load junits plugin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Context Textfile: load multiple files

Tags:

Raj

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us