Hadoop streaming with mongo-hadoop connector fails

Question

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector.

The script is written in Perl and provisioned to all the nodes in the cluster with all the additional dependencies provisioned. The script emits in binary mode a BSON serialized version of the original JSON file.

For some reasons the job fails with the following error:

Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to com.mongodb.hadoop.io.BSONWritable
    at com.mongodb.hadoop.streaming.io.MongoInputWriter.writeValue(MongoInputWriter.java:10)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Not only this but I created a version of the same script in python using the pymono-hadoop package. the job fails with the same error.

After digging a little bit more in the logs for the failed tasks I discovered that the actual error is:

2016-06-13 16:13:11,778 INFO [Thread-12] com.mongodb.hadoop.io.BSONWritable: No Length Header available.java.io.EOFException

The problem is that it fails silently, I've added some logging in the mapper but from what it looks the mapper doesn't even get called. This is how I'm calling the job:

yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
    -libjars /opt/mongo-hadoop/mongo-hadoop-core-1.5.2.jar,/opt/mongo-hadoop/mongo-hadoop-streaming-1.5.2.jar,/opt/mongo-hadoop/mongodb-driver-3.2.2.jar \
    -D mongo.output.uri="${MONGODB}" \
    -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
    -jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
    -io mongodb \
    -input "${INPUT_PATH}" \
    -output "${OUTPUT_PATH}" \
    -mapper "/opt/mongo/mongo_mapper.py"

What am I doing wrong? It seems there's no other way to get data from HDFS into MongoDB...

Tudor Marghidanu · Accepted Answer

I wonder why didn't I try this in the first place:

OUTPUT_PATH=`mktemp -d`

yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.job.name="${BUILD_TAG}" \
    -D mapred.job.queue.name="sr" \
    -input "${INPUT_PATH}" \
    -output "${OUTPUT_PATH}" \
    -mapper "/usr/bin/mongoimport -h ${MONGODB_HOST} -d ${MONGODB_DATABASE} -c ${MONGODB_COLLECTION}"

Guess what? It works like a charm!

Hadoop streaming with mongo-hadoop connector fails

Tags:

language-agnostic

mongodb

perl

hadoop

hadoop-streaming

Tudor Marghidanu

1 Answers

Tudor Marghidanu

Recent Activity

Donate For Us

Hadoop streaming with mongo-hadoop connector fails

Tags:

language-agnostic

mongodb

perl

hadoop

hadoop-streaming

Tudor Marghidanu

1 Answers

Tudor Marghidanu

Related questions

Recent Activity

Donate For Us