Spark DataFrames: registerTempTable vs not

Tags:

apache-spark

I just started with DataFrame yesterday and am really liking it so far.

I dont understand one thing though... (Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)

In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").

So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?

427

asked Jun 18 '15 22:06

user3376961

2 Answers

The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.

val sc: SparkContext = ... val hc = new HiveContext( sc ) val customerDataFrame = myCodeToCreateOrLoadDataFrame() customerDataFrame.registerTempTable( "cust" ) val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId""" val salesPerCustomer: DataFrame = hc.sql( query ) salesPerCustomer.show()

Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.

In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.

If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.

103

answered Sep 28 '22 00:09

rake

It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:

df.createOrReplaceTempView("myTempView")

Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.

%sql SELECT * FROM myTempView

answered Sep 28 '22 02:09

Todor Kolev

Related questions
                            
                                How to create SparkSession from existing SparkContext
                            
                                How to sort an RDD in Scala Spark?
                            
                                map vs mapValues in Spark
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line
                            
                                Querying on multiple Hive stores using Apache Spark
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                How to know which piece of code runs on driver or executor?
                            
                                What is the difference between Spark Standalone, YARN and local mode?
                            
                                How to create correct data frame for classification in Spark ML
                            
                                PySpark dataframe convert unusual string format to Timestamp
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Change nullable property of column in spark dataframe
                            
                                Reading DataFrame from partitioned parquet file
                            
                                Running scheduled Spark job
                            
                                pyspark: Efficiently have partitionBy write to same number of total partitions as original table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With