Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: groupby and then count true values

My data structure is in JSON format:

"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....

I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.

I used keyBy to group by studentId:

    initialRDD.keyBy(lambda row: row['header.studentId'])
              .map(lambda (k,v): (k,v['header.homeworkSubmitted']))
              .map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()

This gives me result like this:

("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))

I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.

Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.

like image 513
Anastasia Avatar asked Jun 24 '16 00:06

Anastasia


People also ask

How do you use group by and count in PySpark?

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

How do you do a value count in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.


2 Answers

df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()

Note it is not valid JSON if there is a "header" or True instead of true

like image 122
shuaiyuancn Avatar answered Nov 15 '22 11:11

shuaiyuancn


I don't have Spark in front of me right now, though I can edit this tomorrow when I do.

But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:

df.where(df.homeworkSubmitted==True).count()

You could then use group by operations if you wanted to explore subsets based on the other columns.

like image 28
Jeff Avatar answered Nov 15 '22 09:11

Jeff