Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?

Tags:

apache-spark

I am using hadoop 2.4.1 and Spark 1.1.0. I have uploaded a dataset of food review to HDFS from here and then I used the following code to read the file and process it on the spark shell:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

var path = "hdfs:///user/hduser/finefoods.txt"
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
var dataset = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
var datasetObj = dataset.map{ rowStr => rowStr.split("\n")}    
var tupleSet = datasetObj.map( strArr => strArr.map( elm => elm.split(": ")(1))).map( arr => (arr(0),arr(1),arr(4).toDouble))
tupleSet.groupBy(t => t._2)

When I run the last line tupleSet.groupBy(t => t._2), the spark shell throws the following exception:

scala> tupleSet.groupBy( t => t._2).first()
14/11/15 22:46:59 INFO spark.SparkContext: Starting job: first at <console>:28
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Registering RDD 11 (groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:28) with 1 output partitions (allowLocal=true)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Final stage: Stage 1(first at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[11] at groupBy at <console>:28), which has no missing parents
14/11/15 22:46:59 INFO storage.MemoryStore: ensureFreeSpace(3592) called with curMem=221261, maxMem=278302556
14/11/15 22:46:59 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 265.2 MB)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from Stage 2 (MappedRDD[11] at groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 3)
14/11/15 22:46:59 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 4)
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:0+134217728
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:134217728+134217728
14/11/15 22:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 2.0 (TID 4)
java.lang.ArrayIndexOutOfBoundsException
14/11/15 22:47:02 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 5, localhost, ANY, 1221 bytes)
14/11/15 22:47:02 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:268435456+102361028
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:

14/11/15 22:47:02 ERROR scheduler.TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Stage 2 was cancelled
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 INFO scheduler.DAGScheduler: Failed to run first at <console>:28
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

But when I use dummy dataset like the following, it works well:

var tupleSet = sc.parallelize(List(
("B001E4KFG0","A3SGXH7AUHU8GW",3.0),
("B001E4KFG1","A3SGXH7AUHU8GW",4.0),
("B001E4KFG2","A3SGXH7AUHU8GW",4.0),
("B001E4KFG3","A3SGXH7AUHU8GW",4.0),
("B001E4KFG4","A3SGXH7AUHU8GW",5.0),
("B001E4KFG5","A3SGXH7AUHU8GW",5.0),
("B001E4KFG0","bbb",5.0)
))

Any idea why?

like image 737
Kitto Avatar asked Nov 15 '14 15:11

Kitto


People also ask

Can spark read and write to HDFS?

Though Spark supports to read from/write to files on multiple file systems like Amazon S3, Hadoop HDFS, Azure, GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.

How do I read a CSV file from spark dataframe?

In Spark CSV/TSV files can be read in using spark.read.csv ("path"), replace the path to HDFS. spark. read. csv ("hdfs://nn1home:8020/file.csv") And Write a CSV file to HDFS using below syntax. Use the write () method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV file.

Is it possible to read data from local file system in spark?

Few points on using Local File System to read data in Spark – Local File system is not Distributed in Nature. Note the file/directory you are accessing has to be available on each node. Hence is not an Ideal Option to read file in Big Data.

What types of data can be read and written in HDFS?

Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS. Spark RDD natively supports reading text files and later with DataFrame, Spark added different data sources like CSV, JSON, Avro, and Parquet.


1 Answers

There's probably an entry in the dataset that does not follow the format and therefore: elm.split(": ")(1) fails, because there's no element at that index.

You can avoid that error by checking the results of the split before accessing the (1) index. One way of doing that could be something like this:

var tupleSet = datasetObj.map(elem => elm.split(": ")).collect{case x if (x.length>1) x(1)} 

One note: Your examples do not seem to match the parsing pipeline in the code. They do not contain the ": " tokens.

Since the transformations are lazy Spark won't tell you much about your input dataset (and you may not notice it) only until executing an action like groupBy().

like image 91
maasg Avatar answered Oct 18 '22 15:10

maasg