I need write about 1 million rows from Spark a DataFrame to MySQL but the insert is too slow. How can I improve it?
Code below:
df = sqlContext.createDataFrame(rdd, schema)
df.write.jdbc(url='xx', table='xx', mode='overwrite')
Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.
Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.
Start a Spark Shell and Connect to MySQL Data With the shell running, you can connect to MySQL with a JDBC URL and use the SQL Context load() function to read a table. The Server and Port properties must be set to a MySQL server.
The answer in https://stackoverflow.com/a/10617768/3318517 has worked for me. Add rewriteBatchedStatements=true
to the connection URL. (See Configuration Properties for Connector/J.)
My benchmark went from 3325 seconds to 42 seconds!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With