Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrames: registerTempTable vs not

I just started with DataFrame yesterday and am really liking it so far.

I dont understand one thing though... (Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)

In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").

So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?

like image 427
user3376961 Avatar asked Jun 18 '15 22:06

user3376961


People also ask

What is registerTempTable in Spark?

registerTempTable (name)[source] Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame . New in version 1.3. 0.

Is Spark SQL faster than DataFrame?

Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.

Is Spark DataFrame faster than pandas?

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.

Is Spark DataFrame lazy?

We know that Spark is written in Scala, and Scala can run lazily, but the execution is Lazy by default for Spark. This means all the operations over an RDD/DataFrame/Dataset are never computed until the action is called.


2 Answers

The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.

val sc: SparkContext = ... val hc = new HiveContext( sc ) val customerDataFrame = myCodeToCreateOrLoadDataFrame() customerDataFrame.registerTempTable( "cust" ) val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId""" val salesPerCustomer: DataFrame = hc.sql( query ) salesPerCustomer.show() 

Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.

In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.

If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.

like image 103
rake Avatar answered Sep 28 '22 00:09

rake


It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:

df.createOrReplaceTempView("myTempView") 

Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.

%sql SELECT * FROM myTempView 
like image 42
Todor Kolev Avatar answered Sep 28 '22 02:09

Todor Kolev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!