Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count number of occurrences by using pyspark

I'm trying to use pyspark to count the number of occurrences.

Suppose I have data like this:

data = sc.parallelize([(1,[u'a',u'b',u'd']),
                       (2,[u'a',u'c',u'd']),
                       (3,[u'a']) ])

count = sc.parallelize([(u'a',0),(u'b',0),(u'c',0),(u'd',0)])

Is possible to count the number of occurrences in data and update in count?

The result should be like [(u'a',3),(u'b',1),(u'c',1),(u'd',2)].

like image 727
someone Avatar asked Apr 11 '16 20:04

someone


2 Answers

I would use Counter:

>>> from collections import Counter
>>>
>>> data.values().map(Counter).reduce(lambda x, y: x + y)
Counter({'a': 3, 'b': 1, 'c': 1, 'd': 2})
like image 137
user6022341 Avatar answered Oct 12 '22 23:10

user6022341


RDDs are immutable and thus cannot be updated. Instead, you compute the count based on your data as:

count = (rdd
         .flatMap(lambda (k, data): data)
         .map(lambda w: (w,1))
         .reduceByKey(lambda a, b: a+b))

Then, if the result can fit in master main memory feel free to .collect() from count.

like image 41
Ben. B. Avatar answered Oct 13 '22 00:10

Ben. B.