Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read an unsupported mix of union types from an Avro file in Apache Spark

I'm trying to switch from reading csv flat files to avro files on spark. following https://github.com/databricks/spark-avro I use:

import com.databricks.spark.avro._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.avro("gs://logs.xyz.com/raw/2016/04/20/div1/div2/2016-04-20-08-28-35.UTC.blah-blah.avro")

and get

java.lang.UnsupportedOperationException: This mix of union types is not supported (see README): ArrayBuffer(STRING)

the readme file states clearly:

This library supports reading all Avro types, with the exception of complex union types. It uses the following mapping from Avro types to Spark SQL types:

when i try to textread the same file I can see the schema

val df = sc.textFile("gs://logs.xyz.com/raw/2016/04/20/div1/div2/2016-04-20-08-28-35.UTC.blah-blah.avro")
df.take(2).foreach(println)

{"name":"log_record","type":"record","fields":[{"name":"request","type":{"type":"record","name":"request_data","fields":[{"name":"datetime","type":"string"},{"name":"ip","type":"string"},{"name":"host","type":"string"},{"name":"uri","type":"string"},{"name":"request_uri","type":"string"},{"name":"referer","type":"string"},{"name":"useragent","type":"string"}]}}

<------- an excerpt of the full reply ------->

since I have little control on the format I'm getting these files in, my question here is - is there a workaround someone tested and can recommend?

I use gc dataproc with

MASTER=yarn-cluster spark-shell --num-executors 4 --executor-memory 4G --executor-cores 4 --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.11:1.3.0

any help would be greatly appreciated.....

like image 586
Zahiro Mor Avatar asked Apr 20 '16 10:04

Zahiro Mor


1 Answers

You won't find any solution that works with Spark SQL. Every column in Spark has to contain values which can be represented as a single DataType so complex union types are simply not representable with Spark Dataframe.

If you want to read data like this you should use RDD API and convert loaded data to DataFrame later.

like image 70
53e4cb18 Avatar answered Sep 19 '22 00:09

53e4cb18