Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Avro File in Spark

I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that.

This is what I did so far.

import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable

val path = "hdfs://dds-nameservice/user/ghagh/"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)

When I do:

avro.take(1)

I get back

res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,...

How do I convert this to a SparkSQL dataframe?

I am using Spark 1.6

Can anyone tell me if there is an easy solution around this?

like image 939
Gayatri Avatar asked Jul 27 '17 20:07

Gayatri


People also ask

How do I view Avro files?

The avro file needs to be confirmed into a file type that Boomi is able to read and write. In this example we are using json as that file type. The scripts below have been successful ran on local atoms but has not been tested on cloud atoms. You will also need to install the Apache Avro jar files.

Does spark support Avro?

Since Spark 2.4 release, Spark SQL provides built-in support for reading and writing Apache Avro data.


Video Answer


1 Answers

For DataFrame I'd go with Avro data source directly:

  • Include spark-avro in packages list. For the latest version use:

    com.databricks:spark-avro_2.11:3.2.0
    
  • Load the file:

    val df = spark.read
      .format("com.databricks.spark.avro")
      .load(path)
    
like image 147
Alper t. Turker Avatar answered Oct 09 '22 19:10

Alper t. Turker