Understanding lambda function inputs in Spark for RDDs

Question

I am doing a course on Spark and I am a bit confused.

So there is the below code. I understand that line 1 is creating tuples (word, 1). Then line 2 is grouping by word and summing the count.

What I don't understand is, what X and Y are in line 2. We only have one numeric input to the lamda function, which is the count column (all the 1's) from wordcounts, so why y?

wordCounts = words.map(lambda x: (x, 1)) #outputs [('self', 1), ('employment', 1), ('building', 1)...
wordCounts2 = wordCounts.reduceByKey(lambda x, y: x + y) # outputs [('self', 111), ('an', 178), ('internet', 26)

Then, we have this piece of code which comes directly after. I understand that it sorts the RDD. To confirm my understanding is X[1] the word and X[2] the totalcount? I would guess so but I am not 100%

Sorry for the stupid questions but I couldn't find a clear explanation!

wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()

pissall · Accepted Answer

1. Why x and y?

Make a key value pair like (word, 1)

Now your key is going to be word and value is going to be 1

When you do reduceByKey, it will add up all values for the same key

reduceByKey(lambda x, y: x + y) will group the rdd elements by the key which is the first element word, and sum up the values. In this particular statement, x is one element accumulating all values of the RDD and y is every other element for the same key/word. Reduce values by adding their values, for the same word or the same key. Might look something like:

# [('This', 1), ('is', 2), ('a', 3), ('random', 1), ('sample.', 2), ('And', 2), ('world', 1), ('count', 2), ('word', 1), ('sample,', 1), ('that', 1), ('it', 1)]

2. Let's breakdown your next question about `wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()`

Following line will swap the elements of the tuples. Essentially, element at position 0 to be moved to position 1, and element at position 1 to be moved to position 0.

reversed_tup = wordCounts2.map(lambda x: (x[1], x[0]))

Output will look like;

# [(1, 'This'), (2, 'is'), (3, 'a'), (1, 'random'), (2, 'sample.'), (2, 'And'), (1, 'world'), (2, 'count'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it')]

Now when you do sortByKey, these tuples will be sorted using the key which is as mentioned above the first element of the tuple. So the rdd will be sorted by the count of the words.

wordCountsSorted = reversed_tup.sortByKey()
wordCountsSorted.collect()
# [(1, 'This'), (1, 'random'), (1, 'world'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it'), (2, 'is'), (2, 'sample.'), (2, 'And'), (2, 'count'), (3, 'a')]

Understanding lambda function inputs in Spark for RDDs

Tags:

python

lambda

apache-spark

pyspark

kikee1222

1 Answers

1. Why x and y?

2. Let's breakdown your next question about `wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()`

pissall

Recent Activity

Donate For Us

Understanding lambda function inputs in Spark for RDDs

Tags:

python

lambda

apache-spark

pyspark

kikee1222

1 Answers

1. Why x and y?

2. Let's breakdown your next question about wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()

pissall

Related questions

Recent Activity

Donate For Us

2. Let's breakdown your next question about `wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()`