I am doing a course on Spark and I am a bit confused.
So there is the below code. I understand that line 1 is creating tuples (word, 1). Then line 2 is grouping by word and summing the count.
What I don't understand is, what X and Y are in line 2. We only have one numeric input to the lamda function, which is the count column (all the 1's) from wordcounts, so why y?
wordCounts = words.map(lambda x: (x, 1)) #outputs [('self', 1), ('employment', 1), ('building', 1)...
wordCounts2 = wordCounts.reduceByKey(lambda x, y: x + y) # outputs [('self', 111), ('an', 178), ('internet', 26)
Then, we have this piece of code which comes directly after. I understand that it sorts the RDD. To confirm my understanding is X[1] the word and X[2] the totalcount? I would guess so but I am not 100%
Sorry for the stupid questions but I couldn't find a clear explanation!
wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()
Make a key value pair like (word, 1)
Now your key is going to be word and value is going to be 1
When you do reduceByKey, it will add up all values for the same key
reduceByKey(lambda x, y: x + y) will group the rdd elements by the key which is the first element word, and sum up the values. In this particular statement, x is one element accumulating all values of the RDD and y is every other element for the same key/word. Reduce values by adding their values, for the same word or the same key.
Might look something like:
# [('This', 1), ('is', 2), ('a', 3), ('random', 1), ('sample.', 2), ('And', 2), ('world', 1), ('count', 2), ('word', 1), ('sample,', 1), ('that', 1), ('it', 1)]
wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()Following line will swap the elements of the tuples. Essentially, element at position 0 to be moved to position 1, and element at position 1 to be moved to position 0.
reversed_tup = wordCounts2.map(lambda x: (x[1], x[0]))
Output will look like;
# [(1, 'This'), (2, 'is'), (3, 'a'), (1, 'random'), (2, 'sample.'), (2, 'And'), (1, 'world'), (2, 'count'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it')]
Now when you do sortByKey, these tuples will be sorted using the key which is as mentioned above the first element of the tuple. So the rdd will be sorted by the count of the words.
wordCountsSorted = reversed_tup.sortByKey()
wordCountsSorted.collect()
# [(1, 'This'), (1, 'random'), (1, 'world'), (1, 'word'), (1, 'sample,'), (1, 'that'), (1, 'it'), (2, 'is'), (2, 'sample.'), (2, 'And'), (2, 'count'), (3, 'a')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With