I have a spark rdd (words
) which consists of arrays of texts. For an example,
words.take(3)
would return something like.
[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]
Now, I want to find out the total number of texts as well as the unique number of texts. If the RDD had only above 3 records,
total_words = 7
unique_words = 4 (only A, B,C,D)
Now in to order to get the total, I did something similar like this
text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()
But I'm stuck on how to retrieve the unique count.
Just flatMap
, take distinct
and count
:
words.flatMap(set).distinct().count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With