My data structure is in JSON format: <pre class="prettyprint"><code>"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True} "header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True} "header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True} "header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False} "header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True} .... </code></pre> I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted. I used keyBy to group by studentId: <pre class="prettyprint"><code> initialRDD.keyBy(lambda row: row['header.studentId']) .map(lambda (k,v): (k,v['header.homeworkSubmitted'])) .map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect() </code></pre> This gives me result like this: <pre class="prettyprint"><code>("1234", Counter({0:0, 1:3}), ("1236", Counter(0:1, 1:1)) </code></pre> I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything. Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.

<pre class="prettyprint"><code>df = sqlContext.read.json('/path/to/your/dataset/') df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count() </code></pre> Note it is not valid JSON if there is a <code>"header"</code> or <code>True</code> instead of <code>true</code>

Pyspark: groupby and then count true values

Tags:

apache-spark

pyspark

My data structure is in JSON format:

"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....

I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.

I used keyBy to group by studentId:

    initialRDD.keyBy(lambda row: row['header.studentId'])
              .map(lambda (k,v): (k,v['header.homeworkSubmitted']))
              .map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()

This gives me result like this:

("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))

I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.

Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.

513

asked Jun 24 '16 00:06

Anastasia

2 Answers

df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()

Note it is not valid JSON if there is a "header" or True instead of true

122

answered Nov 15 '22 11:11

shuaiyuancn

I don't have Spark in front of me right now, though I can edit this tomorrow when I do.

But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:

df.where(df.homeworkSubmitted==True).count()

You could then use group by operations if you wanted to explore subsets based on the other columns.

answered Nov 15 '22 09:11

Jeff

Related questions
                            
                                PySpark - Get indices of duplicate rows
                            
                                org.apache.spark.SparkException: Task not serializable
                            
                                NoClassDefFound : Scala/xml/metadata
                            
                                Column filtering in PySpark
                            
                                'yarn application -list' doesnt show any results
                            
                                Convert RDD to Dataframe in Spark/Scala
                            
                                Explicit cast reading .csv with case class Spark 2.1.0
                            
                                spark - scala - save dataframe to a table with overwrite mode
                            
                                spark foreachPartition, how to get an index of each partition?
                            
                                What is the result of RDD transformation in Spark?
                            
                                Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use
                            
                                pyspark error: 'DataFrame' object has no attribute 'map'
                            
                                Which One is faster? Spark SQL with Where clause or Use of Filter in Dataframe after Spark SQL
                            
                                How to sort a column with Date and time values in Spark?
                            
                                Apache Spark running spark-shell on YARN error
                            
                                Sparse Vector pyspark
                            
                                value toDS is not a member of org.apache.spark.rdd.RDD
                            
                                How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?
                            
                                Null values from a csv on Scala and Apache Spark
                            
                                convert epoch to datetime in Scala / Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With