How can I load Avros in Spark using the schema on-board the Avro file(s)?

Tags:

I am running CDH 4.4 with Spark 0.9.0 from a Cloudera parcel.

I have a bunch of Avro files that were created via Pig's AvroStorage UDF. I want to load these files in Spark, using a generic record or the schema onboard the Avro files. So far I've tried this:

import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.commons.lang.StringEscapeUtils.escapeCsv

import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import java.net.URI
import java.io.BufferedInputStream
import java.io.File
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.avro.specific.SpecificDatumReader
import org.apache.avro.file.DataFileStream
import org.apache.avro.io.DatumReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.mapred.FsInput

val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"
val inURI = new URI(input)
val inPath = new Path(inURI)

val fsInput = new FsInput(inPath, sc.hadoopConfiguration)
val reader =  new GenericDatumReader[GenericRecord]
val dataFileReader = DataFileReader.openReader(fsInput, reader)
val schemaString = dataFileReader.getSchema

val buf = scala.collection.mutable.ListBuffer.empty[GenericRecord]
while(dataFileReader.hasNext)  {
  buf += dataFileReader.next
}
sc.parallelize(buf)

This works for one file, but it can't scale - I am loading all the data into local RAM and then distributing it across the spark nodes from there.

920

asked May 29 '14 23:05

rjurney

2 Answers

To answer my own question:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.commons.lang.StringEscapeUtils.escapeCsv

import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
import java.io.BufferedInputStream
import org.apache.avro.file.DataFileStream
import org.apache.avro.io.DatumReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.avro.mapred.FsInput
import org.apache.avro.Schema
import org.apache.avro.Schema.Parser
import org.apache.hadoop.mapred.JobConf
import java.io.File
import java.net.URI

// spark-shell -usejavacp -classpath "*.jar"

val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"

val jobConf= new JobConf(sc.hadoopConfiguration)
val rdd = sc.hadoopFile(
  input,
  classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)
val f1 = rdd.first
val a = f1._1.datum
a.get("rawLog") // Access avro fields

176

answered Sep 18 '22 15:09

rjurney

This works for me:

import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable

...
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)

answered Sep 19 '22 15:09

dexter

Related questions
                            
                                Lexer/parser to generate Scala code from BNF grammar
                            
                                How does this recursive List flattening work?
                            
                                Why can't Scala find org.apache.commons.lang package?
                            
                                scala range returns Long instead of Int
                            
                                Scala: Implicit evidence for class with type parameter
                            
                                call methods on akka actors in scala
                            
                                Idiomatic Scala for Nested Options
                            
                                Sequentially combine arbitrary number of futures in Scala
                            
                                What does :+= method do defined for scala.collection.immutable.Vector?
                            
                                Scala recursion vs loop: performance and runtime considerations
                            
                                Implementing inner traits in Scala like we do with inner interfaces in Java
                            
                                Scala Play 2.0. Compilation error: IO error while decoding
                            
                                StringContext and macros: a simple example
                            
                                Get the changed HTML content after it's updated by Javascript? (htmlunit)
                            
                                Is it possible to make Scala's JSON.parseFull() not to treat Integers as Decimals?
                            
                                Implementing Iterable
                            
                                What are the differences between mapcat in Clojure and flatMap in Scala in terms of what they operate on?
                            
                                Get Response body from play.api.mvc.Action[AnyContent] in Play framework (Scala)
                            
                                Meaning of type Set = Int => Boolean in Scala
                            
                                Specifying the size of a HashMap in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I load Avros in Spark using the schema on-board the Avro file(s)?

Tags:

scala

apache-spark

hadoop

avro

rjurney

People also ask

2 Answers

rjurney

dexter

Recent Activity

Donate For Us