I have the following code
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
http://spark.apache.org/examples.html i have copied the example from here
I am unable to understand this code especially the keywords
can someone please explain in plain english what's going on.
The Lambda Architecture (LA) enables developers to build large-scale, distributed data processing systems in a flexible and extensible manner, being fault-tolerant both against hardware failures and human mistakes.
General-Purpose — One of the main advantages of Spark is how flexible it is, and how many application domains it has. It supports Scala, Python, Java, R, and SQL.
With container support, we can run any runtime (within resource limitation) on AWS Lambda. In this blog, we will see how we can run a PySpark application on AWS Lambda.
PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.
map
is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMap
is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between map
and flatMap
. Lastly reduceByKey
takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every V
for each K
in your sequence of (K,V)
pairs.
EXAMPLE*:reduce (lambda a, b: a + b,[1,2,3,4])
This says aggregate the whole list with +
so it will do
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
final result is 10
Reduce by key is the same thing except you do a reduce for each unique key.
So to explain it in your example
file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
.map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
.reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS
So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.
I don't use python much tell me if I made a mistake.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With