In Spark SQL (working with the Java APIs) I have a DataFrame
.
The DataFrame
has a select
method.
I wonder if it's a transformation or an action?
I just need a confirmation and a good reference which states that clearly.
Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy).
When we look at the Spark API, we can easily spot the difference between transformations and actions. If a function returns a DataFrame , Dataset , or RDD , it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.
Therefore, select() method is useful when you simply need to select a subset of columns from a particular Spark DataFrame. On the other hand, selectExpr() comes in handy when you need to select particular columns while at the same time you also need to apply some sort of transformation over particular column(s).
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
It is transformation. Please refer: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
If you execute the below code you will be able to see the output in the console
import org.apache.spark.sql.SparkSession
object learnSpark2 extends App {
val sparksession = SparkSession.builder()
.appName("Learn Spark")
.config("spark.master", "local")
.getOrCreate()
val range = sparksession.range(1, 500).toDF("numbers")
range.select(range.col("numbers"), range.col("numbers") + 10).show(2)
}
+-------+--------------+
|numbers|(numbers + 10)|
+-------+--------------+
| 1| 11|
| 2| 12|
If you execute the folowing code with only select and not show, you will not be able to see any output though the code is execute, then it mean select is just a transformation and it is not action. So it will not be evaluated.
object learnSpark2 extends App {
val sparksession = SparkSession.builder()
.appName("Learn Spark")
.config("spark.master","local")
.getOrCreate()
val range = sparksession.range(1, 500).toDF("numbers")
range.select(range.col("numbers"), range.col("numbers") + 10)
}
In the console:
19/01/03 22:46:25 INFO Utils: Successfully started service 'sparkDriver' on port 55531.
19/01/03 22:46:25 INFO SparkEnv: Registering MapOutputTracker
19/01/03 22:46:25 INFO SparkEnv: Registering BlockManagerMaster
19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/01/03 22:46:25 INFO DiskBlockManager: Created local directory at
C:\Users\swilliam\AppData\Local\Temp\blockmgr-9abc8a2c-15ee-4e4f-be04-9ef37ace1b7c
19/01/03 22:46:25 INFO MemoryStore: MemoryStore started with capacity 1992.9 MB
19/01/03 22:46:25 INFO SparkEnv: Registering OutputCommitCoordinator
19/01/03 22:46:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/01/03 22:46:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://10.192.99.214:4040
19/01/03 22:46:26 INFO Executor: Starting executor ID driver on host localhost
19/01/03 22:46:26 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 55540.
19/01/03 22:46:26 INFO NettyBlockTransferService: Server created on 10.192.99.214:55540
19/01/03 22:46:26 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/01/03 22:46:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.99.214:55540 with 1992.9 MB RAM, BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/UDEMY/SparkJob/spark-warehouse/').
19/01/03 22:46:26 INFO SharedState: Warehouse path is 'file:/C:/UDEMY/SparkJob/spark-warehouse/'.
19/01/03 22:46:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/01/03 22:46:29 INFO SparkContext: Invoking stop() from shutdown hook
19/01/03 22:46:29 INFO SparkUI: Stopped Spark web UI at http://10.192.99.214:4040
19/01/03 22:46:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/03 22:46:29 INFO MemoryStore: MemoryStore cleared
19/01/03 22:46:29 INFO BlockManager: BlockManager stopped
19/01/03 22:46:29 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/03 22:46:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/03 22:46:29 INFO SparkContext: Successfully stopped SparkContext
19/01/03 22:46:29 INFO ShutdownHookManager: Shutdown hook called
19/01/03 22:46:29 INFO ShutdownHookManager: Deleting directory C:\Users\swilliam\AppData\Local\Temp\spark-c69bfb9b-f351-45af-9947-77950b23dd15
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore="C:\Program Files\SquirrelSQL\certificates\jssecacerts"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With