Counting distinct texts in a Spark RDD with array objects

Question

I have a spark rdd (words) which consists of arrays of texts. For an example,

words.take(3)

would return something like.

[ ["A", "B"], ["B", "C"], ["C", "A", "D"] ]

Now, I want to find out the total number of texts as well as the unique number of texts. If the RDD had only above 3 records,

total_words = 7
unique_words = 4 (only A, B,C,D)

Now in to order to get the total, I did something similar like this

text_count_rdd = words.map(lambda x: len(x))
text_count_rdd.sum()

But I'm stuck on how to retrieve the unique count.

zero323 · Accepted Answer

Just flatMap, take distinct and count:

words.flatMap(set).distinct().count()

Donate For Us