Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.

For example, I have a DataFrame called logs that has the following form:

machine_id  | event     | other_stuff
 34131231   | thing     |   stuff
 83423984   | notathing | notstuff
 34131231   | thing    | morestuff

and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using

val machineId = logs
  .where($"event" === "thing")
  .select("machine_id")
  .groupBy("machine_id")

I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.

I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:

  1. Extract unique id's from a log table.
  2. Use unique ids to extract all events for a particular id.
  3. Use some kind of analysis on this data that has been extracted.

It's the first two steps I would appreciate some guidance with here.

I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.

Thanks

like image 815
Dean Avatar asked Nov 12 '15 11:11

Dean


People also ask

How do you get all the columns after groupBy in PySpark?

One way to get all columns after doing a groupBy is to use join function. data_joined will now have all columns including the count values.

How does groupBy work in Spark?

Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.

Is groupBy an action in Spark?

Filter, groupBy and map are the examples of transformations. Actions: Actions refer to an operation which also applies on RDD, that instructs Spark to perform computation and send the result back to driver. This is an example of action.

How do you use groupBy and count in PySpark?

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.


1 Answers

Just use distinct not groupBy:

val machineId = logs.where($"event"==="thing").select("machine_id").distinct

Which will be equivalent to SQL:

SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'

GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this

SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id

where ... has to be provided by agg or equivalent method.

like image 140
zero323 Avatar answered Oct 20 '22 17:10

zero323