Converting mysql table to spark dataset is very slow compared to same from csv file

Tags:

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");

 set.take(500)

The whole operation takes 20 to 30 sec.

Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;

String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;

SparkSession spark=StartSpark.getSparkSession();

SQLContext sc = spark.sqlContext();

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
            .read()
            .option("url", url)
            .option("dbtable", this.tableName)
            .option("driver","com.mysql.jdbc.Driver")
            .format("jdbc")
            .load();
set.take(500);

This is taking 5 to 10 minutes. I am running spark inside jvm. Using same configuration in both cases.

I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.

My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?

497

asked Mar 09 '17 13:03

KOUSIK MANDAL

1 Answers

This problem has been covered multiple times on StackOverflow:

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?

and in external sources:

https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads

so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

To distribute reads:

use ranges with lowerBound / upperBound:

Properties properties;
Lower

Dataset<Row> set = sc
    .read()
    .option("partitionColumn", "foo")
    .option("numPartitions", "3")
    .option("lowerBound", 0)
    .option("upperBound", 30)
    .option("url", url)
    .option("dbtable", this.tableName)
    .option("driver","com.mysql.jdbc.Driver")
    .format("jdbc")
    .load();

predicates

Properties properties;
Dataset<Row> set = sc
    .read()
    .jdbc(
        url, this.tableName,
        {"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
        properties
    )

151

answered Sep 20 '22 00:09

user7698675

Related questions
                            
                                Why class Node in LinkedList defined as static but not normal class [duplicate]
                            
                                BeanDefinitionRegistryPostProcessor - How to register a @Configuration class as BeanDefinition and get its @Beans registered as well
                            
                                Ignore SSL errors with groovy's toURL method
                            
                                Autowired Service is Null in ResponseBody method when using PreAuthorize
                            
                                MessageSource not being read properly during testing
                            
                                if condition in logback - print log messages in two different folders
                            
                                Enable assertions per-package
                            
                                What decides which functional interface to create from a lambda?
                            
                                what is the different between default and generated serial version uid in java?
                            
                                How can I have an abstract method that accepts an argument of type "my type"?
                            
                                Modify Spring boot Embedded Tomcat extract path
                            
                                Checking the validity of a variable before calling the super constructor
                            
                                How to sort Numeric field in Lucene 6
                            
                                Java 8 - how to access object and method encapsulated as lambda
                            
                                Create multiple instances of the same class with Guice
                            
                                maven Failed to execute goal org.jvnet.jax-ws-commons:jaxws-maven-plugin:2.3:wsimport
                            
                                What's the recommended corePoolSize passed to ThreadPoolExecutor/ScheduledThreadPoolExecutor?
                            
                                project euler #10, java, correct for small numbers
                            
                                Adding progressbar at the end of recyclerview
                            
                                Return type of generic method (Java)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting mysql table to spark dataset is very slow compared to same from csv file

Tags:

java

mysql

amazon-s3

jdbc

apache-spark

KOUSIK MANDAL

People also ask

1 Answers

user7698675

Recent Activity

Donate For Us