Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting distinct texts in a Spark RDD with array objects

I have a spark rdd (words) which consists of arrays of texts. For an example,

words.take(3)

would return something like.

[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]

Now, I want to find out the total number of texts as well as the unique number of texts. If the RDD had only above 3 records,

total_words = 7
unique_words = 4 (only A, B,C,D)

Now in to order to get the total, I did something similar like this

text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()

But I'm stuck on how to retrieve the unique count.

like image 238
rclakmal Avatar asked Oct 30 '22 09:10

rclakmal


1 Answers

Just flatMap, take distinct and count:

words.flatMap(set).distinct().count()
like image 98
zero323 Avatar answered Nov 13 '22 15:11

zero323