I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector.
The script is written in Perl and provisioned to all the nodes in the cluster with all the additional dependencies provisioned. The script emits in binary mode a BSON serialized version of the original JSON file.
For some reasons the job fails with the following error:
Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to com.mongodb.hadoop.io.BSONWritable
at com.mongodb.hadoop.streaming.io.MongoInputWriter.writeValue(MongoInputWriter.java:10)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Not only this but I created a version of the same script in python using the pymono-hadoop package. the job fails with the same error.
After digging a little bit more in the logs for the failed tasks I discovered that the actual error is:
2016-06-13 16:13:11,778 INFO [Thread-12] com.mongodb.hadoop.io.BSONWritable: No Length Header available.java.io.EOFException
The problem is that it fails silently, I've added some logging in the mapper but from what it looks the mapper doesn't even get called. This is how I'm calling the job:
yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
-libjars /opt/mongo-hadoop/mongo-hadoop-core-1.5.2.jar,/opt/mongo-hadoop/mongo-hadoop-streaming-1.5.2.jar,/opt/mongo-hadoop/mongodb-driver-3.2.2.jar \
-D mongo.output.uri="${MONGODB}" \
-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
-io mongodb \
-input "${INPUT_PATH}" \
-output "${OUTPUT_PATH}" \
-mapper "/opt/mongo/mongo_mapper.py"
What am I doing wrong? It seems there's no other way to get data from HDFS into MongoDB...
I wonder why didn't I try this in the first place:
OUTPUT_PATH=`mktemp -d`
yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.job.name="${BUILD_TAG}" \
-D mapred.job.queue.name="sr" \
-input "${INPUT_PATH}" \
-output "${OUTPUT_PATH}" \
-mapper "/usr/bin/mongoimport -h ${MONGODB_HOST} -d ${MONGODB_DATABASE} -c ${MONGODB_COLLECTION}"
Guess what? It works like a charm!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With