Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connect Amazon EMR Spark with MySQL (writing data)

I have a potentially stupid question; I actually fixed this issue when running Spark locally but haven't been able to resolve it when running it on AWS EMR.

Basically, I have a pyspark script which I submit that reads in data, manipulates it, processes it into a Spark Dataframe and writes it into a MySQL table that I have already hosted elsewhere on AWS RDS.

This is EMR 5.6, with Spark 2.1.1

I downloaded the latest drivers for MySQL connector ("mysql-connector-java-5.1.42-bin.jar") and put them into my instance with the Master Node (basically downloaded it onto my local laptop and then used scp to put it in the master node).

I then found my spark-defaults.conf file under /etc/spark/conf and edited the following parameters:

spark.driver.extraClassPath
spark.executor.extraClassPath

To both of these, I added the path to my mysql-connector file, which was found at /home/hadoop/mysql-connector-java-5.1.42-bin.jar

Based on this SO post (Adding JDBC driver to Spark on EMR), I use the following command to submit (included the entire path from "extraClassPath"):

spark-submit sample_script.py --driver-class-path /home/hadoop/mysql-connector-java-5.1.42-bin.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*

In my code, I have a spark dataframe and the following code is what writes to the database:

SQL_CONN = "jdbc:mysql://name.address.amazonaws.com:8000/dbname?user=user&password=pwd"
spark_df.write.jdbc(SQL_CONN, table="tablename", mode="append", properties={"driver":'com.mysql.jdbc.Driver'})

The specific error I get is this:

java.lang.ClassNotFoundException (com.mysql.jdbc.Driver) [duplicate 51]

Any input would be appreciated... this feels like a really stupid mistake on my part that I an unable to pinpoint.

like image 265
shishy Avatar asked Mar 08 '23 18:03

shishy


2 Answers

Fixed - I was stupid and forgot to put the jar file in my slave nodes as well. I forgot that --driver-class-path doesn't automatically distribute the jar to my slaves.

It worked once I put the jar file in the same root directory as it was in my master node (i.e. /home/hadoop in my case).

Hope this helps.

like image 61
shishy Avatar answered Mar 21 '23 02:03

shishy


Although the answer by author is correct, but instead of putting jar manually, you can use --jars to submit a jar and it will handle rest for you

spark-submit  --jars /home/hadoop/mysql-connector-java-5.1.42-bin.jar sample-script.py

Although not asked explicitly, but in EMR notebook, since you dont want to run spark-submit yourself, there is more easier way

Upload the jar file to s3, Let this be the first cell of the notebook

%%configure -f
{
    "conf": {
        "spark.jars": "s3://jar-test/mysql-connector-java-5.1.42-bin.jar"        
    }
}
like image 42
A.B Avatar answered Mar 21 '23 00:03

A.B