MySQL read with PySpark

Tags:

python-3.x

pyspark-sql

I have the following test code:

from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')


if __name__ == '__main__':
    df = sqlContext.read.format("jdbc").options(
        url="jdbc:mysql://localhost/mysql",
        driver="com.mysql.jdbc.Driver",
        dbtable="users",
        user="user",
        password="****",
        properties={"driver": 'com.mysql.jdbc.Driver'}
    ).load()

    print(df)

When I run it, I get the following error:

java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

In Scala, this is solved by importing the .jar mysql-connector-java into the project.

However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.

I have seen this solved with examples like

spark --package=mysql-connector-java testfile.py

But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.

546

asked Sep 03 '17 12:09

Santi Peñate-Vera

3 Answers

You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:

import os
from pyspark import SparkConf, SparkContext

SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)

or you can add them to your $SPARK_HOME/conf/spark-defaults.conf

142

answered Sep 25 '22 19:09

MaFF

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("Word Count")\
    .config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
    .getOrCreate()

dataframe_mysql = spark.read\
    .format("jdbc")\
    .option("url", "jdbc:mysql://localhost/database_name")\
    .option("driver", "com.mysql.jdbc.Driver")\
    .option("dbtable", "employees").option("user", "root")\
    .option("password", "12345678").load()

print(dataframe_mysql.columns)

"/home/tuhin/mysql.jar" is the location of mysql jar file

answered Sep 22 '22 19:09

MD. HUMAYUN KABIR TUHIN

If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:

from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
    url='jdbc:mysql://localhost:3306/database1',
    driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
    dbtable='table1',
    user='root',
    password='****').load()
print (source_df)
source_df.show()

answered Sep 24 '22 19:09

Feilong Wang

Related questions
                            
                                Could not find any downloads that satisfy the requirement mysql-connector-python
                            
                                Checking SHA1 on a String
                            
                                Continued Fractions Python [closed]
                            
                                Python list basic manipulation [duplicate]
                            
                                What are the difference between sep and end in print function?
                            
                                Multiple inputs from one input
                            
                                How to eliminate all strings from a list
                            
                                How to natively increment a dictionary element's value?
                            
                                Django paginator with many pages
                            
                                Adding a new row to a MultiIndex pandas DataFrame with both values and lists
                            
                                MySQL: django.db.utils.OperationalError: (1698, "Access denied for user 'root'@'localhost'") with correct username and pw
                            
                                Remove redundant square brackets in a list python [duplicate]
                            
                                Python pandas -> select by condition in columns name
                            
                                How to print __init__ function arguments in python?
                            
                                How can I use psycopg2.extras in sqlalchemy?
                            
                                Setting the interval of x-axis for seaborn plot
                            
                                Pythonic way to limit ranges on a variable?
                            
                                Replace column names in a pandas data frame that partially match a string
                            
                                python process list elements in batches
                            
                                GraphViz's executables not found : Anaconda-3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With