I know that count
called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count
on GroupedData (empDF.groupBy("department")
), I got another DataFrame as the result (res1). This leads me to believe that count
in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count
, instead, they started when I ran res1.show
.
I haven't been able to find any documentation that suggests count
could be a transformation as well. Could someone please shed some light on this?
Case 1:
You use rdd.count()
to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy
on dataframe
which returns RelationalGroupedDataset
object, and you are calling count
on grouped Dataset which returns a Dataframe
, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With