I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy
I don't get a data frame back.
For example, I have a DataFrame
called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame
to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame
to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData
objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
One way to get all columns after doing a groupBy is to use join function. data_joined will now have all columns including the count values.
Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.
Filter, groupBy and map are the examples of transformations. Actions: Actions refer to an operation which also applies on RDD, that instructs Spark to perform computation and send the result back to driver. This is an example of action.
PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.
Just use distinct
not groupBy
:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData
is not intended to be used directly. It provides a number of methods, where agg
is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame
. In terms of SQL what you have after where
and groupBy
is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ...
has to be provided by agg
or equivalent method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With