Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL - DataFrame - select - transformation or action?

In Spark SQL (working with the Java APIs) I have a DataFrame.

The DataFrame has a select method. I wonder if it's a transformation or an action?

I just need a confirmation and a good reference which states that clearly.

like image 384
peter.petrov Avatar asked Oct 05 '17 09:10

peter.petrov


People also ask

Is select action or transformation in Spark?

Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy).

What is the difference between transformation and action in Spark?

When we look at the Spark API, we can easily spot the difference between transformations and actions. If a function returns a DataFrame , Dataset , or RDD , it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.

What is the difference between select and selectExpr in Spark?

Therefore, select() method is useful when you simply need to select a subset of columns from a particular Spark DataFrame. On the other hand, selectExpr() comes in handy when you need to select particular columns while at the same time you also need to apply some sort of transformation over particular column(s).

How do I select a DataFrame in Spark?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.


2 Answers

It is transformation. Please refer: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

like image 186
Nikhil Avatar answered Oct 09 '22 03:10

Nikhil


If you execute the below code you will be able to see the output in the console

import org.apache.spark.sql.SparkSession

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master", "local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10).show(2)
}

+-------+--------------+

|numbers|(numbers + 10)|

+-------+--------------+

| 1| 11|

| 2| 12|

If you execute the folowing code with only select and not show, you will not be able to see any output though the code is execute, then it mean select is just a transformation and it is not action. So it will not be evaluated.

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master","local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10)
}

In the console:

19/01/03 22:46:25 INFO Utils: Successfully started service 'sparkDriver' on port 55531.

19/01/03 22:46:25 INFO SparkEnv: Registering MapOutputTracker

19/01/03 22:46:25 INFO SparkEnv: Registering BlockManagerMaster

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: Using

org.apache.spark.storage.DefaultTopologyMapper for getting topology information

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up

19/01/03 22:46:25 INFO DiskBlockManager: Created local directory at

C:\Users\swilliam\AppData\Local\Temp\blockmgr-9abc8a2c-15ee-4e4f-be04-9ef37ace1b7c

19/01/03 22:46:25 INFO MemoryStore: MemoryStore started with capacity 1992.9 MB

19/01/03 22:46:25 INFO SparkEnv: Registering OutputCommitCoordinator

19/01/03 22:46:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.

19/01/03 22:46:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at

http://10.192.99.214:4040

19/01/03 22:46:26 INFO Executor: Starting executor ID driver on host localhost

19/01/03 22:46:26 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 55540.
19/01/03 22:46:26 INFO NettyBlockTransferService: Server created on 10.192.99.214:55540

19/01/03 22:46:26 INFO BlockManager: Using

org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy

19/01/03 22:46:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.99.214:55540 with 1992.9 MB RAM, BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/UDEMY/SparkJob/spark-warehouse/').
19/01/03 22:46:26 INFO SharedState: Warehouse path is 'file:/C:/UDEMY/SparkJob/spark-warehouse/'.
19/01/03 22:46:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/01/03 22:46:29 INFO SparkContext: Invoking stop() from shutdown hook
19/01/03 22:46:29 INFO SparkUI: Stopped Spark web UI at http://10.192.99.214:4040
19/01/03 22:46:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/03 22:46:29 INFO MemoryStore: MemoryStore cleared
19/01/03 22:46:29 INFO BlockManager: BlockManager stopped
19/01/03 22:46:29 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/03 22:46:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/03 22:46:29 INFO SparkContext: Successfully stopped SparkContext
19/01/03 22:46:29 INFO ShutdownHookManager: Shutdown hook called
19/01/03 22:46:29 INFO ShutdownHookManager: Deleting directory C:\Users\swilliam\AppData\Local\Temp\spark-c69bfb9b-f351-45af-9947-77950b23dd15
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore="C:\Program Files\SquirrelSQL\certificates\jssecacerts"
like image 1
Samuel William Avatar answered Oct 09 '22 01:10

Samuel William