I just started with DataFrame yesterday and am really liking it so far.
I dont understand one thing though... (Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").
So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?
registerTempTable (name)[source] Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame . New in version 1.3. 0.
Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.
We know that Spark is written in Scala, and Scala can run lazily, but the execution is Lazy by default for Spark. This means all the operations over an RDD/DataFrame/Dataset are never computed until the action is called.
The reason to use the registerTempTable( tableName )
method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame
, you can also issue SQL queries via the sqlContext.sql( sqlQuery )
method, that use that DataFrame as an SQL table. The tableName
parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ... val hc = new HiveContext( sc ) val customerDataFrame = myCodeToCreateOrLoadDataFrame() customerDataFrame.registerTempTable( "cust" ) val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId""" val salesPerCustomer: DataFrame = hc.sql( query ) salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select
and groupBy
is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext
instead of a regular SQLContext
. The Hive query language supports a broader range of SQL than available via a plain SQLContext
.
It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql SELECT * FROM myTempView
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With