Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding lambda function inputs in Spark for RDDs

I am doing a course on Spark and I am a bit confused.

So there is the below code. I understand that line 1 is creating tuples (word, 1). Then line 2 is grouping by word and summing the count.

What I don't understand is, what X and Y are in line 2. We only have one numeric input to the lamda function, which is the count column (all the 1's) from wordcounts, so why y?

wordCounts = words.map(lambda x: (x, 1)) #outputs [('self', 1), ('employment', 1), ('building', 1)...
wordCounts2 = wordCounts.reduceByKey(lambda x, y: x + y) # outputs [('self', 111), ('an', 178), ('internet', 26)

Then, we have this piece of code which comes directly after. I understand that it sorts the RDD. To confirm my understanding is X[1] the word and X[2] the totalcount? I would guess so but I am not 100%

Sorry for the stupid questions but I couldn't find a clear explanation!

wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()
like image 275
kikee1222 Avatar asked Oct 29 '25 01:10

kikee1222


1 Answers

1. Why x and y?

Make a key value pair like (word, 1)

Now your key is going to be word and value is going to be 1

When you do reduceByKey, it will add up all values for the same key

reduceByKey(lambda x, y: x + y) will group the rdd elements by the key which is the first element word, and sum up the values. In this particular statement, x is one element accumulating all values of the RDD and y is every other element for the same key/word. Reduce values by adding their values, for the same word or the same key. Might look something like:

# [('This', 1), ('is', 2), ('a', 3), ('random', 1), ('sample.', 2), ('And', 2), ('world', 1), ('count', 2), ('word', 1), ('sample,', 1), ('that', 1), ('it', 1)]

2. Let's breakdown your next question about wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()

Following line will swap the elements of the tuples. Essentially, element at position 0 to be moved to position 1, and element at position 1 to be moved to position 0.

reversed_tup = wordCounts2.map(lambda x: (x[1], x[0]))

Output will look like;

# [(1, 'This'), (2, 'is'), (3, 'a'), (1, 'random'), (2, 'sample.'), (2, 'And'), (1, 'world'), (2, 'count'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it')]

Now when you do sortByKey, these tuples will be sorted using the key which is as mentioned above the first element of the tuple. So the rdd will be sorted by the count of the words.

wordCountsSorted = reversed_tup.sortByKey()
wordCountsSorted.collect()
# [(1, 'This'), (1, 'random'), (1, 'world'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it'), (2, 'is'), (2, 'sample.'), (2, 'And'), (2, 'count'), (3, 'a')]
like image 90
pissall Avatar answered Oct 31 '25 16:10

pissall