I am trying to work with Apache spark with data source MySQL. I have a cluster having 1 master and 1 slave node and both have 8 GB ram and 2 cores I am submitting my SQL query to spark using spark-shell and that table having 6402821 this many rows. I am performing a group by onto that table. and time taken by MySQL is 5.2secs and using spark when I am performing query the time is 21Secs. why is this happening?
i am also setting some configurations like partitionColumn, upperBound, lowerBound, and numofPartitions but still no change.
I have also tried with executing the query using 1,2,4 cores but the time taken by the spark is same 21Secs.
is this problem occurs because of my MySQL database is on a single machine and all spark nodes try to query on data onto that single machine?
Can any one help me to solve this issue?
the database having a table called demo_call_stats on which i am trying to query is:
val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zeroDateTimeBehaviour"->"convertToNull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchSize" -> "10000", "partitionColumn" -> "newpartition", "lowerBound" -> "0", "upperBound" -> "4", "numPartitions" -> "4")).load()
jdbcDF.createOrReplaceTempView("call_stats")
val sqlDF = sql("select Count(*), classification_id from call_stats where campaign_id = 77 group by classification_id")
sqlDF.show()
Any help will be most appreciated.
Thanks
There is a couple of things you should understand here:
Despite what you might have heard, Spark isn't 'faster than MySQL', simply because this kind of generality doesn't mean anything. Spark is faster than MySQL for some queries, and MySQL is faster than Spark for others. Generally speaking, MySQL is a relational database, meaning it has been conceived to serve as a back-end for an application. It is optimized to access records efficiently as long as they are indexed.
When thinking about databases, I like to think of them as a library with one librarian to help you get the books you want (I am speaking about a very old school library here, without any computer to help the librarian).
If you ask your librarian: "I want to know how many books you have that are about Geopolitics", the librarian can go to the Geopolitics shelf and count the number of books on that shelf.
If you ask your librarian: "I want to know how many books you have that have at least 500 pages", the librarian will have to look at every single book in the library to answer your query. In SQL this is called a full table scans. Of course you can have several librarians (processors) working on the query to go faster, but you cannot have more than a few of them (let's say up to 16) inside your library (computer).
Now, Spark has been designed to handle large volume of data, namely libraries that are so big that they won't fit into a single buildings, and even if it does, they will be so many that even 16 librarians will take days to look at them all to answer your second query.
What makes Spark faster than MySQL is just this: if you put your books in several buildings, you can have 16 librarians per building working on your answer. You can also handle a larger number of books.
Also, since Spark is mostly made to answer the second type of queries rather than queries like "Please bring me 'The Portrait of Dorian Gray', by Oscar Wilde", it means that Spark doesn't care, at least by default, to sort your books in any particular way. This means that if you want to find that particular book with spark, your librarians will have to go through the entire library to find it.
Of course, Spark uses many other type of optimizations to perform some queries more efficiently, but indexation is not one of them (if you are familiar with the notion of Primary Key in mySQL, there are no such thing in Spark). Other optimizations include storage format like Parquet and ORC that allow you to read only the columns that are useful to answer your queries, and compression (e.g. Snappy), which are aimed at increasing the number of books you can fit in your library without having to push the walls.
I hope this metaphor helped you, but please bear in mind that this is just a metaphor and doesn't fit reality perfectly.
Now, to get back to your question specific details:
Assuming campaign_id
is your primary key or you created an index on this column, MySQL will only have
to read the rows for which campaign_id = 77
.
On the other hand, Spark will have to ask for mySQL to send all the rows in that table to Spark.
If Spark is clever, it will only ask for the one with campaign_id = 77
, and maybe it will send multiple queries to mySQL to get ranges in parallel.
But this means that all the data that MySQL could just read and aggregate will have to be serialized, sent to Spark, and be aggregated by Spark.
I hope you see why this should take longer.
If you want Spark to answer your queries faster than MySQL, you should try copying your table in another format like this.
// replace this line :
// jdbcDF.createOrReplaceTempView("call_stats")
// with :
jdbcDF.write.format("orc").saveAsTable("call_stats")
Another thing you could try is caching your data like this:
jdbcDF.cache().createOrReplaceTempView("call_stats")
Caching won't be bring any improvement for the first query as it will cache the data while performing it, but if you continue querying the same view, it might be faster. But as I explained above, this doesn't mean Spark will be faster than mySQL for everything.
For small data and local deployements, you can also get a perf improvement by changing this configuration
parameter: spark.sql.shuffle.partitions=4
which is 200 by default.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With