Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write and read raw byte arrays in Spark - using Sequence File SequenceFile

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?

like image 573
samthebest Avatar asked Jun 06 '14 13:06

samthebest


1 Answers

Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes

val rdd: RDD[Array[Byte]] = ???

// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
  .saveAsSequenceFile("/output/path", codecOpt)

// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
  .map(_._2.copyBytes())
like image 178
samthebest Avatar answered Sep 19 '22 23:09

samthebest