Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Is "count" on Grouped Data a Transformation or an Action?

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following

scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]

scala> empDF.show
+---+----------------+---+----------+
| id|            name|age|department|
+---+----------------+---+----------+
|  1|    James Gordon| 30|  Homicide|
|  2|  Harvey Bullock| 35|  Homicide|
|  3| Kristen Kringle| 28|   Records|
|  4|    Edward Nygma| 30| Forensics|
|  5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+

scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]

scala> res1.show
+----------+-----+                                                              
|department|count|
+----------+-----+
|  Homicide|    2|
|   Records|    1|
| Forensics|    2|
+----------+-----+

When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.

I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?

like image 530
Amber Avatar asked Oct 24 '18 10:10

Amber


1 Answers

Case 1:

You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.

for ex: rdd.count // it returns a Long value

Case 2:

If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.

for ex: df.count // it returns a Long value

Case 3:

In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.

for ex:

 df.groupBy("department") // returns RelationalGroupedDataset
          .count // returns a Dataframe so a transformation
          .count // returns a Long value since called on DF so an action
like image 103
bob Avatar answered Sep 21 '22 12:09

bob