Difference between df.SaveAsTable and spark.sql(Create table..)

Tags:

Referring to here on the difference between saveastable and insertInto

What is the difference between the following two approaches :

df.saveAsTable("mytable");

and

df.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists " + my_temp_table);
spark.sql("create table mytable as select * from 
my_temp_table");

In which case is the table stored in memory and in which case physically on disk ?

Also, as per my understanding, createOrReplaceTempView only register the dataframe (already in memory) to be accessible through Hive query, without actually persisting it, is it correct ?

I have to Join hundreds of tables and hit OutOfMemory issue. In terms of efficiency, what would be the best way ?

df.persist() and df.join(..).join(..).join(..).... #hundred joins
createOrReplaceTempView then join with spark.sql(),
SaveAsTable (? not sure the next step)
Write to disk with Create Table then join with spark.sql()?

846

asked Apr 15 '19 15:04

Kenny

1 Answers

Let's go step-by-step.

In the case of df.saveAsTable("mytable"), the table is actually written to storage (HDFS/ S3). It is a Spark action.

On the other hand: df.createOrReplaceTempView("my_temp_table") is a transformation. It is just an identifier to be used for the DAG of df. Nothing is actually stored in memory or on disk.

spark.sql("drop table if exists " + my_temp_table) drops the table.

spark.sql("create table mytable as select * from my_temp_table") creates mytable on storage. createOrReplaceTempView creates tables in global_temp database.

It would be best to modify the query to:

create table mytable as select * from global_temp.my_temp_table

createOrReplaceTempView only register the dataframe (already in memory) to be accessible through Hive query, without actually persisting it, is it correct?

Yes, for large DAGs, spark will automatically cache data depending on spark.memory.fraction setting. Check this page.

I have to Join hundreds of tables and hit OutOfMemory issue. In terms of efficiency, what would be the best way ?
df.persist() and df.join(..).join(..).join(..).... #hundred joins

createOrReplaceTempView then join with spark.sql(),

SaveAsTable (? not sure the next step)

Write to disk with Create Table then join with spark.sql()?

persist would store some data in cached format depending on available memory and for end table that is generated by joining hundreds of tables, this would probably is not the best approach.

It would not be possible to suggest the approach that would work for you, but here are some general patterns:

If writes fail with OOM and the default spark.shuffle.partitions are used, then the start point is to increase the shuffle partition count to ensure that each executor's partition is sized correctly depending on its memory availability.

The spark.shuffle.partitions setting can be set across different joins, it doesn't need to be a constant across the Spark job.

Calculating partition size become difficult if multiple tables are involved. In that case, writing to disk and reading back before large tables is a good idea.

For small tables, less than 2GB, broadcasting is a possibility. The default limit is 10MB (I think) but it can be changed.

It would be best if the final table is stored on disk rather than serving thrift clients through temp tables.

Good luck!

180

answered Oct 03 '22 22:10

Sai

Related questions
                            
                                Spark - Multiple filters on RDD in one pass
                            
                                When is it appropriate to use a TrieMap?
                            
                                Play 2.4 disable certain filters set based on request path or method
                            
                                ExecutionContext to use with mapAsync in Akka-Streams
                            
                                How to assemble an Akka Streams sink from multiple file writes?
                            
                                Too many elements for Tuple: 27, allowed: 22
                            
                                Spark Source code: How to understand withScope method
                            
                                Async before and after for creating and dropping scala slick tables in scalatest
                            
                                Sequences in Spark dataframe
                            
                                How to use UUID in a VARCHAR column with Slick?
                            
                                Scala - add type definition to declaration - keyboard shortcut
                            
                                How to clean up substreams in continuous Akka streams
                            
                                How to add empty map type column to DataFrame?
                            
                                What are isomorphism and homomorphisms
                            
                                Is Traits in Scala an Interface or an Abstract Class?
                            
                                Combining `OptionT` and `EitherT` to handle `Future[Either[Error, Option[T]]]`
                            
                                Interpreting a list of free monads vs. interpreting a free monad of a list
                            
                                Exclude a specific implicit from a Scala project
                            
                                How do I ignore decoding failures in a JSON array?
                            
                                How to add proper error handling to cats-effect's Resource

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between df.SaveAsTable and spark.sql(Create table..)

Tags:

scala

apache-spark

apache-spark-sql

pyspark

hive

Kenny

People also ask

1 Answers

Sai

Recent Activity

Donate For Us