Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop streaming with mongo-hadoop connector fails

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector.

The script is written in Perl and provisioned to all the nodes in the cluster with all the additional dependencies provisioned. The script emits in binary mode a BSON serialized version of the original JSON file.

For some reasons the job fails with the following error:

Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to com.mongodb.hadoop.io.BSONWritable
    at com.mongodb.hadoop.streaming.io.MongoInputWriter.writeValue(MongoInputWriter.java:10)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Not only this but I created a version of the same script in python using the pymono-hadoop package. the job fails with the same error.

After digging a little bit more in the logs for the failed tasks I discovered that the actual error is:

2016-06-13 16:13:11,778 INFO [Thread-12] com.mongodb.hadoop.io.BSONWritable: No Length Header available.java.io.EOFException

The problem is that it fails silently, I've added some logging in the mapper but from what it looks the mapper doesn't even get called. This is how I'm calling the job:

yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
    -libjars /opt/mongo-hadoop/mongo-hadoop-core-1.5.2.jar,/opt/mongo-hadoop/mongo-hadoop-streaming-1.5.2.jar,/opt/mongo-hadoop/mongodb-driver-3.2.2.jar \
    -D mongo.output.uri="${MONGODB}" \
    -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
    -jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
    -io mongodb \
    -input "${INPUT_PATH}" \
    -output "${OUTPUT_PATH}" \
    -mapper "/opt/mongo/mongo_mapper.py"

What am I doing wrong? It seems there's no other way to get data from HDFS into MongoDB...

like image 602
Tudor Marghidanu Avatar asked Nov 19 '25 19:11

Tudor Marghidanu


1 Answers

I wonder why didn't I try this in the first place:

OUTPUT_PATH=`mktemp -d`

yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.job.name="${BUILD_TAG}" \
    -D mapred.job.queue.name="sr" \
    -input "${INPUT_PATH}" \
    -output "${OUTPUT_PATH}" \
    -mapper "/usr/bin/mongoimport -h ${MONGODB_HOST} -d ${MONGODB_DATABASE} -c ${MONGODB_COLLECTION}"

Guess what? It works like a charm!

like image 189
Tudor Marghidanu Avatar answered Nov 22 '25 08:11

Tudor Marghidanu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!