I am processing a bunch of avro files which are stored in a nested directory structure in HDFS. The files are stored in year/month/day/hour format directory structure.
I wrote this simple code to process
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val rootDir = "/user/cloudera/rootDir"
val rdd1 = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](rootDir)
rdd1.count()
I get an exception which I have pasted below. The biggest problem I am facing is that it doesn't tell me which file is not a data file. So I will have to go in HDFS and scan through 1000s of files to see which one was not a data file.
is there a more efficient way to debug/solve this?
5/11/01 19:01:49 WARN TaskSetManager: Lost task 1084.0 in stage 14.0 (TID 11562, datanode): java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183)
at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
One of the nodes on your cluster where the block is located are down. The data is not found because of that, which gives the error. The solution is to repair and bring up all the nodes in the cluster.
I was getting the exact error below with my Java map reduce program that uses avro input. Below is a rundown of the issue.
Error: java.io.IOException: Not a data file. at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183) at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
I decided to cat the file because I was able to run the program over another file in the same folder of HDFS and receive the following.
INFO hdfs.DFSClient: No node available for <Block location in your
cluster> from any node: java.io.IOException: No live nodes contain
block BP-6168826450-10.1.10.123-1457116155679:blk_1073853378_112574
after checking nodes = [], ignoredNodes = null No live nodes contain
current block Block locations: Dead nodes: . Will get new block
locations from namenode and retry...
We have been having some problems with our cluster and unfortunately some nodes were down. After remedy of the problem this error was resolved
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With