What is the most efficient way of moving data out of Hive and into MongoDB?

3 Answers

You can do the export with the Hadoop-MongoDB connector. Just run the Hive query in your job's main method. This output will then be used by the Mapper in order to insert the data into MongoDB.

Example:

Here I'm inserting a semicolon separated text file (id;firstname;lastname) to a MongoDB collection using a simple Hive query :

import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.mongodb.hadoop.MongoOutputFormat;
import com.mongodb.hadoop.io.BSONWritable;
import com.mongodb.hadoop.util.MongoConfigUtil;

public class HiveToMongo extends Configured implements Tool {

    private static class HiveToMongoMapper extends
            Mapper<LongWritable, Text, IntWritable, BSONWritable> {

        //See: https://issues.apache.org/jira/browse/HIVE-634
        private static final String HIVE_EXPORT_DELIMETER = '\001' + "";
        private IntWritable k = new IntWritable();
        private BSONWritable v = null;

        @Override
        public void map(LongWritable key, Text value, Context context) 
          throws IOException, InterruptedException {

            String [] split = value.toString().split(HIVE_EXPORT_DELIMETER);

            k.set(Integer.parseInt(split[0]));
            v = new BSONWritable();
            v.put("firstname", split[1]);
            v.put("lastname", split[2]);
            context.write(k, v);

        }
    }

    public static void main(String[] args) throws Exception {
        try {
            Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
        }
        catch (ClassNotFoundException e) {
            System.out.println("Unable to load Hive Driver");
            System.exit(1);
        }

        try {
            Connection con = DriverManager.getConnection(
                "jdbc:hive://localhost:10000/default");

            Statement stmt = con.createStatement();    
            String sql = "INSERT OVERWRITE DIRECTORY " +
                    "'hdfs://localhost:8020/user/hive/tmp' select * from users";
            stmt.executeQuery(sql);

        }
        catch (SQLException e) {
            System.exit(1);
        }

        int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args);
        System.exit(res);
    }

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Path inputPath = new Path("/user/hive/tmp");
        String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll";
        MongoConfigUtil.setOutputURI(conf, mongoDbPath);

        /*
        Add dependencies to distributed cache via 
        DistributedCache.addFileToClassPath(...) :
        - mongo-hadoop-core-x.x.x.jar
        - mongo-java-driver-x.x.x.jar
        - hive-jdbc-x.x.x.jar
        HadoopUtils is an own utility class
        */
        HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf);
        HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf);

        Job job = new Job(conf, "HiveToMongo");

        FileInputFormat.setInputPaths(job, inputPath);
        job.setJarByClass(HiveToMongo.class);
        job.setMapperClass(HiveToMongoMapper.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(MongoOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(0);

        job.submit();
        System.out.println("Job submitted.");
        return 0;
    }
}

One drawback is that a 'staging area' (/user/hive/tmp) is needed to store the intermediate Hive output. Furthermore as far as I know the Mongo-Hadoop connector doesn't support upserts.

I'm not quite sure but you can also try to fetch the data from Hive without running hiveserver which exposes a Thrift service so that you can probably save some overhead. Look at the source code of Hive's org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting) method which actually executes the query. Then you can hack together something like this:

...
LogUtils.initHiveLog4j();
CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class));
ss.in = System.in;
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
SessionState.start(ss);

Driver qp = new Driver();
processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver
...

Side notes:

There's also a hive-mongo connector implementation you might also check. It's also worth having a look at the implementation of the Hive-HBase connector to get some idea if you want to implement a similar one for MongoDB.

125

answered Nov 11 '22 23:11

Lorand Bendig

Have you looked into Sqoop? It's supposed to make it very simple to move data between Hadoop and SQL/NoSQL databases. This article also gives an example of using it with Hive.

answered Nov 11 '22 22:11

HypnoticSheep

Take a look at the hadoop-MongoDB connector project :

http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

"This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB."

not sure if it will work for your use case but it's worth looking at.

answered Nov 11 '22 23:11

Jean-Philippe Bond

Related questions
                            
                                mongodb install - requirements?
                            
                                How would you implement these queries efficiently in MongoDB?
                            
                                Undo convertToCapped to a collection
                            
                                Integrating Devise with Mongoid
                            
                                How to implement a permissions system like highrise or facebook
                            
                                Getting mongoStat through mongoDB Java driver
                            
                                Syncing a mongodb database over dropbox
                            
                                Finding an Embedded Document by a specific property in Mongoose, Node.js, MongodDB
                            
                                MongoDB: Two-field index vs document field index
                            
                                embeds_many and embeds_one from same model with Mongoid
                            
                                Mongodb single database vs collection per client vs database per client
                            
                                Is mongoDB efficient in doing multi-key lookups?
                            
                                Running replication on Mongo DB issues
                            
                                How can I create friendly URLs with MongoDB/Node.js?
                            
                                How do I capture a MongoDB query as a string and display it in my Node JS page (using the mongojs driver)?
                            
                                Securing MongoDB Transport in the cloud [closed]
                            
                                mongoengine ListField(ReferenceField()) and custom primary_key
                            
                                MongoDB objectId references
                            
                                Total MongoDB storage size
                            
                                Unable to use MongoDB from my C# console application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the most efficient way of moving data out of Hive and into MongoDB?

Tags:

mongodb

hadoop

hive

Alex N.

People also ask

3 Answers

Lorand Bendig

HypnoticSheep

Jean-Philippe Bond

Recent Activity

Donate For Us