Skipping fields in a record using spark-avro

Tags:

Update: spark-avro package was update to support this scenario. https://github.com/databricks/spark-avro/releases/tag/v3.1.0

I have an AVRO file that was created by a third party outside my control, which I need to process using spark. The AVRO schema is a record where one of the fields is a mixed union type:

{    
    "name" : "Properties",                              
    "type" : {                                          
    "type" : "map",                                   
    "values" : [ "long", "double", "string", "bytes" ]
}

This is unsupported with the spark-avro reader:

In addition to the types listed above, it supports reading of three types of union types: union(int, long) union(float, double) union(something, null), where something is one of the supported Avro types listed above or is one of the supported union types.

Reading about AVRO's schema evolution and resolution, I expect to be able to read the file while skipping the problematic field by specifying a different reader schema that omits this field. According to AVRO Schema Resolution docs, it should work:

if the writer's record contains a field with a name not present in the reader's record, the writer's value for that field is ignored.

So I modified using

 val df = sqlContext.read.option("avroSchema", avroSchema).avro(path)

Where avroSchema is the exact same schema, the writer used, but without the problematic field.

But still I get the same error regarding mixed union types.

Is this scenario of schema evolution supported with AVRO? with avro-spark? Is there another way to achieve my goal?

Update: I have tested the same scenario (same file actually) with Apache Avro 1.8.1 and it works as expected. Then it must be specifically with spark-avro. any ideas?

326

asked Nov 03 '16 15:11

itaysk

1 Answers

Update: spark-avro package was update to support this scenario. https://github.com/databricks/spark-avro/releases/tag/v3.1.0

This does not actually answer my question, rather a different solution for the same problem.

Since currently spark-avro is does not have this functionality (see my comment for the question) - I have instead used avro's org.apache.avro.mapreduce and spark's newAPIHadoopFile. Here is a simple example of that:

val path = "..."
val conf = new SparkConf().setAppName("avro test")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 
val sc = new SparkContext(conf)

val avroRdd = sc.newAPIHadoopFile(path,
  classOf[AvroKeyInputFormat[GenericRecord]],
  classOf[AvroKey[GenericRecord]],
  classOf[NullWritable])

contrarily to spark-avro, the official avro libs supports mixed union types and schema evolution.

199

answered Nov 04 '22 00:11

itaysk

Related questions
                            
                                ipython/Jupyter notebook with authentication
                            
                                Naive Bayes in Spark MLlib
                            
                                Scope of Spark's `persist` or `cache`
                            
                                Access files that start with underscore in apache spark
                            
                                Combining Two Spark Streams On Key
                            
                                How to process the different graph files to be processed independently in between the cluster nodes in Apache Spark?
                            
                                Spark: equivelant of zipwithindex in dataframe
                            
                                Unable to create dataframe from RDD of Row using case class
                            
                                How to load Impala table directly to Spark using JDBC?
                            
                                Spark: PySpark + Cassandra query performance
                            
                                Spark 2.0 Dataset Encoder with trait
                            
                                cast schema of a data frame in Spark and Scala
                            
                                How To Convert List Object to JavaDStream Spark?
                            
                                Spark Exception when converting a MySQL table to parquet
                            
                                Scala & Spark: Dataframe.write._ on Windows
                            
                                PySpark, Decision Trees (Spark 2.0.0)
                            
                                PySpark in iPython notebook raises Py4JJavaError when using count() and first()
                            
                                sqlContext HiveDriver error on SQLException: Method not supported
                            
                                How to compute percentiles in Apache Spark
                            
                                How to convert column with string type to int form in pyspark data frame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Skipping fields in a record using spark-avro

Tags:

apache-spark

avro

spark-avro

itaysk

People also ask

1 Answers

itaysk

Recent Activity

Donate For Us