Using groupBy in Spark and getting back to a DataFrame

Tags:

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.

For example, I have a DataFrame called logs that has the following form:

machine_id  | event     | other_stuff
 34131231   | thing     |   stuff
 83423984   | notathing | notstuff
 34131231   | thing    | morestuff

and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using

val machineId = logs
  .where($"event" === "thing")
  .select("machine_id")
  .groupBy("machine_id")

I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.

I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:

Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.

It's the first two steps I would appreciate some guidance with here.

I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.

Thanks

815

asked Nov 12 '15 11:11

Dean

1 Answers

Just use distinct not groupBy:

val machineId = logs.where($"event"==="thing").select("machine_id").distinct

Which will be equivalent to SQL:

SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'

GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this

SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id

where ... has to be provided by agg or equivalent method.

140

answered Oct 20 '22 17:10

zero323

Related questions
                            
                                How to change the functional insert-sort code to be tail recursive
                            
                                Scala type error with Try[Int]
                            
                                Return copy of case class from generic function without runtime cast
                            
                                Scalastyle "Public method must have explicit type" in Play Framework
                            
                                Better way to compose test fixtures in ScalaTest
                            
                                Passing a shapeless extensible record to a function (never ending story?
                            
                                How to read in numbers from n lines into a Scala list?
                            
                                Can't put PartialFunction in scala class constructor
                            
                                Scala indexOf accepts everything
                            
                                Java 'reduceLeft' signature / Lower-bounded Type Arguments
                            
                                How do I create a class or object in Scala Macros?
                            
                                Configure repo for SBT launcher in Travis build
                            
                                package statement marked as "unused import"
                            
                                Is there a way to change the replication factor of RDDs in Spark?
                            
                                reassignment to val while initializing in primary constructor
                            
                                slick 3 auto-generated - default value (timestamp) column, how to define a Rep[Date] function
                            
                                How to compare multiple rows?
                            
                                Scala F-bounded polymorphism on object
                            
                                Accessing to PostgreSQL array via ScalikeJDBC
                            
                                Using different monads in for-comprehension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using groupBy in Spark and getting back to a DataFrame

Tags:

scala

apache-spark

apache-spark-sql

Dean

People also ask

1 Answers

zero323

Recent Activity

Donate For Us