I have the following code <pre class="prettyprint"><code>file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") </code></pre> http://spark.apache.org/examples.html i have copied the example from here I am unable to understand this code especially the keywords <ol> <li>flatmap, </li> <li>map and </li> <li>reduceby</li> </ol> can someone please explain in plain english what's going on.

<code>map</code> is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). <code>flatMap</code> is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between <code>map</code> and <code>flatMap</code>. Lastly <code>reduceByKey</code> takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every <code>V</code> for each <code>K</code> in your sequence of <code>(K,V)</code> pairs. EXAMPLE*: <code>reduce (lambda a, b: a + b,[1,2,3,4])</code> This says aggregate the whole list with <code>+</code> so it will do <pre class="prettyprint"><code>1 + 2 = 3 3 + 3 = 6 6 + 4 = 10 final result is 10 </code></pre> Reduce by key is the same thing except you do a reduce for each unique key. <hr> So to explain it in your example <pre class="prettyprint"><code>file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS </code></pre> <hr> So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data. <hr> I don't use python much tell me if I made a mistake.

Apache spark and python lambda

Tags:

python

apache-spark

I have the following code

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

http://spark.apache.org/examples.html i have copied the example from here

I am unable to understand this code especially the keywords

flatmap,
map and
reduceby

can someone please explain in plain english what's going on.

411

asked Jul 04 '14 13:07

jhon.smith

1 Answers

map is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMap is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between map and flatMap. Lastly reduceByKey takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every V for each K in your sequence of (K,V) pairs.

EXAMPLE^*:
reduce (lambda a, b: a + b,[1,2,3,4])

This says aggregate the whole list with + so it will do

1 + 2 = 3  
3 + 3 = 6  
6 + 4 = 10  
final result is 10

Reduce by key is the same thing except you do a reduce for each unique key.

So to explain it in your example

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
             .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
             .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS

So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.

_{I don't use python much tell me if I made a mistake.}

109

answered Nov 14 '22 23:11

aaronman

Related questions
                            
                                How to apply max & min boundaries to a value without using conditional statements
                            
                                Functions and if - else in python. Mutliple conditions. Codeacademy
                            
                                How can I create flask project using IntelliJ IDEA 12 Ultimate Edition?
                            
                                How to sort in decreasing value first then increasing in second value [duplicate]
                            
                                Elegant way to skip elements in an iterable
                            
                                how to embed a code editor in a page?
                            
                                Reverse the hash() function in python
                            
                                python: how to find consecutive pairs of letters by regex?
                            
                                Python testing whether a string is one of a certain set of values
                            
                                OpenCV findChessboardCorners function is failing in a (apparently) simple scenario
                            
                                os.path.join in python returns 'wrong' path?
                            
                                Can I get the altitude with geopy in Python? (with Longitude/Latitude)
                            
                                Printing each item of a variable on a separate line in Python
                            
                                Printing python code to PDF [closed]
                            
                                How do I get a list of tables from a Firebird database?
                            
                                Passing a dictionary to threaded function in Python
                            
                                how can the directory of a usb drive connected to a system be obtained?
                            
                                How do I accept input from arrow keys, or accept directional input?
                            
                                ImportError: No module named _imagingtk
                            
                                Possible To Format A List Without * Magic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With