How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

Tags:

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).

I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).

I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.

I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.

bin/pyspark --driver-memory 40g --executor-memory 40g

df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()

When I tried with BIG table (5B records) then no results returned upon completion of query.

871

asked Aug 24 '15 17:08

Dev Patel

1 Answers

All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.

Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.

Instead you want to try the following AI (since Spark 1.4.0+):

sqlctx.read.jdbc(
  url = "<URL>",
  table = "<TABLE>",
  columnName = "<INTEGRAL_COLUMN_TO_PARTITION>", 
  lowerBound = minValue,
  upperBound = maxValue,
  numPartitions = 20,
  connectionProperties = new java.util.Properties()
)

There is also an option to push down some filtering.

If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:

    val predicates = 
  Array(
    "2015-06-20" -> "2015-06-30",
    "2015-07-01" -> "2015-07-10",
    "2015-07-11" -> "2015-07-20",
    "2015-07-21" -> "2015-07-31"
  )
  .map {
    case (start, end) => 
      s"cast(DAT_TME as date) >= date '$start'  AND cast(DAT_TME as date) <= date '$end'"
  }

 predicates.foreach(println) 

// Below is the result of how predicates were formed 
//cast(DAT_TME as date) >= date '2015-06-20'  AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01'  AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11'  AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21'  AND cast(DAT_TME as date) <= date '2015-07-31'


sqlctx.read.jdbc(
  url = "<URL>",
  table = "<TABLE>",
  predicates = predicates,
  connectionProperties = new java.util.Properties()
)

It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.

Check the source code at DataFrameReader.scala

191

answered Sep 22 '22 00:09

Gianmario Spacagna

Related questions
                            
                                Who can give a clear explanation for `combineByKey` in Spark?
                            
                                How to get applicationId of Spark application deployed to YARN in Scala?
                            
                                How to use functions provide by DataFrameNaFunctions class in Spark, on a Dataframe?
                            
                                Spark UDF error - Schema for type Any is not supported
                            
                                Apache Spark: Difference between parallelize and broadcast
                            
                                Issue while opening Spark shell
                            
                                pyspark: counter part of like() method in dataframe
                            
                                Spark avoid creating _temporary directory in S3
                            
                                Is there any better way to convert Array<int> to Array<String> in pyspark
                            
                                Change schema of existing dataframe
                            
                                save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"
                            
                                How to combine n-grams into one vocabulary in Spark?
                            
                                Scala Dataframe null check for columns
                            
                                Spark, Scala - column type determine
                            
                                How to remove empty rows from an Pyspark RDD
                            
                                Why can't we create an RDD using Spark session
                            
                                Pyspark window function with condition
                            
                                Cast column containing multiple string date formats to DateTime in Spark
                            
                                Transpose DataFrame Without Aggregation in Spark with scala
                            
                                Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

Tags:

apache-spark

pyspark

spark-dataframe

teradata

Dev Patel

People also ask

1 Answers

Gianmario Spacagna

Recent Activity

Donate For Us