Spark: Writing to Avro file

Tags:

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file:

val job = new Job(new Configuration())
AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema))

rdd.map(elem => (new SparkAvroKey(doTransformation(elem._1)), elem._2))
   .saveAsNewAPIHadoopFile(outputPath, 
  classOf[AvroKey[GenericRecord]], 
  classOf[org.apache.hadoop.io.NullWritable], 
  classOf[AvroKeyOutputFormat[GenericRecord]], 
  job.getConfiguration)

When running this Spark complains that Schema$recordSchema is not serializable.

If I uncomment the .map call (and just have rdd.saveAsNewAPIHadoopFile), the call succeeds.

What am I doing wrong here?

Any idea?

466

asked Dec 16 '13 13:12

user1013725

1 Answers

The issue here is related to the non-serializability of the avro.Schema class used in the Job. The exception is thrown when you try to reference the schema object from the code inside the map function.

For instance, if you try to do as follows, you will get the "Task not serializable" exception:

val schema = new Schema.Parser().parse(new File(jsonSchema))
...
rdd.map(t => {
  // reference to the schema object declared outside
  val record = new GenericData.Record(schema)
})

You can make everything to work by just creating a new instance of the schema inside the function block:

val schema = new Schema.Parser().parse(new File(jsonSchema))
// The schema above should not be used in closures, it's for other purposes
...
rdd.map(t => {
  // create a new Schema object
  val innserSchema = new Schema.Parser().parse(new File(jsonSchema))
  val record = new GenericData.Record(innserSchema)
  ...
})

Since you would not like parsing the avro schema for every record you handle, a better solution will be to parse the schema at partition level. The following also works:

val schema = new Schema.Parser().parse(new File(jsonSchema))
// The schema above should not be used in closures, it's for other purposes
...
rdd.mapPartitions(tuples => {
  // create a new Schema object
  val innserSchema = new Schema.Parser().parse(new File(jsonSchema))

  tuples.map(t => {
    val record = new GenericData.Record(innserSchema)
    ...
    // this closure will be bundled together with the outer one 
    // (no serialization issues)
  })
})

The code above works as long as you provide a portable reference to the jsonSchema file, since the map function is going to be executed by multiple remote executors. It can be a reference to a file in HDFS or it can be packaged along with the application in the JAR (you will use the class-loader functions to get its contents in the latter case).

For those who are trying to use Avro with Spark, notice that there are still some unresolved compilation problems and you have to use the following import on Maven POM:

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-mapred</artifactId>
  <version>1.7.7</version>
  <classifier>hadoop2</classifier>
<dependency>

Note the "hadoop2" classifier. You can track the issue at https://issues.apache.org/jira/browse/SPARK-3039.

answered Sep 25 '22 03:09

Nicola Ferraro

Related questions
                            
                                "host not allowed" error when deploying a play framework application to Amazon AWS with Boxfuse
                            
                                Unresolved dependency SBT 0.13.0 after update
                            
                                object xml is not a member of package scala
                            
                                Scala - calculate average of SomeObj.double in a List[SomeObj]
                            
                                Scala regex ignorecase
                            
                                Flatten Scala Try
                            
                                Why I can't execute scala file?
                            
                                Spark textFile vs wholeTextFiles
                            
                                Is there Scala aware high level byte-code manipulation tool like Javassist?
                            
                                Json Serialization for Trait with Multiple Case Classes (Sum Types) in Scala's Play
                            
                                Using private constructor in a macro
                            
                                Parallelize Scala's Iterator
                            
                                OAuth 2.0 provider implementation for Scala/Lift
                            
                                class A has one type parameter, but type B has one
                            
                                What are the important features of the shapeless API (in Scala), and what do they do?
                            
                                How to set preferences for ALS implicit feedback in Collaborative Filtering?
                            
                                Java's File.toString or Path.toString with a specific path separator
                            
                                High-Order ScalaCheck
                            
                                How do you run a patch/partial database UPDATE in Scala Slick?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Writing to Avro file

Tags:

serialization

scala

apache-spark

avro

user1013725

People also ask

1 Answers

Nicola Ferraro

Recent Activity

Donate For Us