Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.io.IOException: Not a data file

I am processing a bunch of avro files which are stored in a nested directory structure in HDFS. The files are stored in year/month/day/hour format directory structure.

I wrote this simple code to process

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
val rootDir = "/user/cloudera/rootDir"
val rdd1 = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](rootDir)
rdd1.count()

I get an exception which I have pasted below. The biggest problem I am facing is that it doesn't tell me which file is not a data file. So I will have to go in HDFS and scan through 1000s of files to see which one was not a data file.

is there a more efficient way to debug/solve this?

5/11/01 19:01:49 WARN TaskSetManager: Lost task 1084.0 in stage 14.0 (TID 11562, datanode): java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183)
    at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
like image 533
Knows Not Much Avatar asked Nov 01 '15 19:11

Knows Not Much


1 Answers

One of the nodes on your cluster where the block is located are down. The data is not found because of that, which gives the error. The solution is to repair and bring up all the nodes in the cluster.

I was getting the exact error below with my Java map reduce program that uses avro input. Below is a rundown of the issue.

Error: java.io.IOException: Not a data file.    at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:183)   at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:94) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)   at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)    at
 java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)   at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

I decided to cat the file because I was able to run the program over another file in the same folder of HDFS and receive the following.

INFO hdfs.DFSClient: No node available for <Block location in your
cluster> from any node: java.io.IOException: No live nodes contain
 block BP-6168826450-10.1.10.123-1457116155679:blk_1073853378_112574
 after checking nodes = [], ignoredNodes = null No live nodes contain
 current block Block locations: Dead nodes: . Will get new block
 locations from namenode and retry...

We have been having some problems with our cluster and unfortunately some nodes were down. After remedy of the problem this error was resolved

like image 193
SparkleGoat Avatar answered Sep 20 '22 20:09

SparkleGoat