Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Spark, Java, and MongoDB to work together

Similar to my question here but this time it's Java, not Python, causing me problems.

I have followed the steps advised (to the best of my knowledge) here but since I'm using hadoop-2.6.1 I think I should be using the old API, rather than the new API referred to in the example.

I'm working on Ubuntu and the various component versions I have are

  • Spark spark-1.5.1-bin-hadoop2.6
  • Hadoop hadoop-2.6.1
  • Mongo 3.0.8
  • Mongo-Hadoop connector jars included via Maven
  • Java 1.8.0_66
  • Maven 3.0.5

My Java program is basic

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import com.mongodb.hadoop.MongoInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.bson.BSONObject;

public class SimpleApp {
  public static void main(String[] args) {
    Configuration mongodbConfig = new Configuration();
    mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
    mongodbConfig.set("mongo.input.uri", "mongodb://localhost:27017/db.collection");
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
        mongodbConfig,            // Configuration
        MongoInputFormat.class,   // InputFormat: read from a live cluster.
        Object.class,             // Key class
        BSONObject.class          // Value class
    );
  }
}

It is building fine using Maven (mvn package) with the following pom file

<project>
<groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>3.2.0</version>
    </dependency>
    <dependency>
      <groupId>org.mongodb.mongo-hadoop</groupId>
      <artifactId>mongo-hadoop-core</artifactId>
      <version>1.4.2</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
    </plugins>
</build>
</project>

I then submit the jar

/usr/local/share/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar

and get the following error

Exception in thread "main" java.lang.NoClassDefFoundError: com/mongodb/hadoop/MongoInputFormat
    at SimpleApp.main(SimpleApp.java:18)

NOTICE

I edited this question on the 18th December as it had grown far too confusing and verbose. Previous comments might look irrelevant. The context of the question, however, is the same.

like image 764
Philip O'Brien Avatar asked Oct 19 '22 21:10

Philip O'Brien


1 Answers

I faced same problems but after lot of trials& changes, I got my work done with this code. I'm running Maven project with netbeans on ubuntu & Java 7 Hope this helps.

Include maven-shade-plugin if there are any conflicts b/w classes

P.S: I don't know about your particular error, but faced such a plenty. and this code is running perfectly .

   <dependencies>
              <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>1.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>1.5.1</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.14</version>
        </dependency>
        <dependency>
            <groupId>org.mongodb.mongo-hadoop</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>1.4.1</version>
        </dependency>
    </dependencies>

Java code

  Configuration conf = new Configuration();
    conf.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
    conf.set("mongo.input.uri", "mongodb://localhost:27017/databasename.collectionname");
    SparkConf sconf = new SparkConf().setMaster("local").setAppName("Spark UM Jar");

    JavaRDD<User> UserMaster = sc.newAPIHadoopRDD(conf, MongoInputFormat.class, Object.class, BSONObject.class)
            .map(new Function<Tuple2<Object, BSONObject>, User>() {
                @Override
                public User call(Tuple2<Object, BSONObject> v1) throws Exception {
                    //return User
                }

            }
like image 50
Anil Avatar answered Oct 31 '22 13:10

Anil