My data structure is in JSON format:
"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....
I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.
I used keyBy to group by studentId:
initialRDD.keyBy(lambda row: row['header.studentId'])
.map(lambda (k,v): (k,v['header.homeworkSubmitted']))
.map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()
This gives me result like this:
("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))
I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.
Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.
PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()
Note it is not valid JSON if there is a "header"
or True
instead of true
I don't have Spark in front of me right now, though I can edit this tomorrow when I do.
But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:
df.where(df.homeworkSubmitted==True).count()
You could then use group by operations if you wanted to explore subsets based on the other columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With