Failed to find data source: com.mongodb.spark.sql.DefaultSource

Question

I'm trying to connect spark (pyspark) to mongodb as follows:

conf = SparkConf()
conf.set('spark.mongodb.input.uri', default_mongo_uri)
conf.set('spark.mongodb.output.uri', default_mongo_uri)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession \
    .builder \
    .appName("my-app") \
    .config("spark.mongodb.input.uri", default_mongo_uri) \
    .config("spark.mongodb.output.uri", default_mongo_uri) \
    .getOrCreate()

But when I do the following:

users = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
        .option("uri", '{uri}.{col}'.format(uri=mongo_uri, col='users')).load()

I get this error:

java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource

I did the same thing from pyspark shell and I was able to retrieve data. This is the command I ran:

pyspark --conf "spark.mongodb.input.uri=mongodb_uri" --conf "spark.mongodb.output.uri=mongodburi" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.2

But here we have the option to specify the package we need to use. But what about standalone apps and scripts. how can I configure mongo-spark-connector there.

Any ideas?

Konstantin K. · Accepted Answer

Here how I did it in Jupyter notebook:
1. Download jars from central or any other repository and put them in directory called "jars":
mongo-spark-connector_2.11-2.4.0
mongo-java-driver-3.9.0
2. Create session and write/read any data

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

working_directory = 'jars/*'

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection") \
    .config("spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection") \
    .config('spark.driver.extraClassPath', working_directory) \
    .getOrCreate()

people = my_spark.createDataFrame([("JULIA", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
                            ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 22)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.select('*').where(col("name") == "JULIA").show()

As a result you will see this:
enter image description here

Pranjal Gharat · Answer

If you are using SparkContext & SparkSession, you have mentioned the connector jar packages in SparkConf, check the following Code:

    from pyspark import SparkContext,SparkConf
    conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
    sc = SparkContext(conf=conf)

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
    .config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
    .getOrCreate()

    df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
    df.printSchema()

If you are using only SparkSession then use following code:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
    .config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.2') \
    .getOrCreate()

    df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
    df.printSchema()

Ankit Vohra · Answer

I was also facing same error "java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource" while trying to connect to MongoDB from Spark (2.3).

I had to download and copy mongo-spark-connector_2.11 JAR file(s) into jars directory of spark installation.

That resolved my issue and I was successfully able to call my spark code via spark-submit.

Hope it helps.

Failed to find data source: com.mongodb.spark.sql.DefaultSource

Tags:

mongodb

apache-spark

pyspark

rootkit

3 Answers

Konstantin K.

Pranjal Gharat

Ankit Vohra

Recent Activity

Donate For Us

Failed to find data source: com.mongodb.spark.sql.DefaultSource

Tags:

mongodb

apache-spark

pyspark

rootkit

3 Answers

Konstantin K.

Pranjal Gharat

Ankit Vohra

Related questions

Recent Activity

Donate For Us