I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that. This is what I did so far. <pre class="prettyprint"><code>import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper} import org.apache.hadoop.io.NullWritable val path = "hdfs://dds-nameservice/user/ghagh/" val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path) </code></pre> When I do: <pre class="prettyprint"><code>avro.take(1) </code></pre> I get back <pre class="prettyprint"><code>res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,... </code></pre> How do I convert this to a SparkSQL dataframe? I am using Spark 1.6 Can anyone tell me if there is an easy solution around this?

For <code>DataFrame</code> I'd go with Avro data source directly: <ul> <li> Include spark-avro in packages list. For the latest version use: <pre class="prettyprint"><code>com.databricks:spark-avro_2.11:3.2.0 </code></pre> </li> <li> Load the file: <pre class="prettyprint"><code>val df = spark.read .format("com.databricks.spark.avro") .load(path) </code></pre> </li> </ul>

Reading Avro File in Spark

Tags:

scala

apache-spark

apache-spark-sql

apache-zeppelin

I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that.

This is what I did so far.

import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable

val path = "hdfs://dds-nameservice/user/ghagh/"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)

When I do:

avro.take(1)

I get back

res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,...

How do I convert this to a SparkSQL dataframe?

I am using Spark 1.6

Can anyone tell me if there is an easy solution around this?

939

asked Jul 27 '17 20:07

Gayatri

Video Answer

1 Answers

For DataFrame I'd go with Avro data source directly:

Include spark-avro in packages list. For the latest version use:
```
com.databricks:spark-avro_2.11:3.2.0
```

Load the file:

val df = spark.read
  .format("com.databricks.spark.avro")
  .load(path)

147

answered Oct 09 '22 19:10

Alper t. Turker

Related questions
                            
                                Splitting strings in Apache Spark using Scala
                            
                                Why is there no "Functor" trait in Scala? [closed]
                            
                                How to send plain SQL queries (and retrieve results) using scala slick 3
                            
                                Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?
                            
                                What do WARN messages mean when starting spark-shell?
                            
                                assigning multiple vals in conditionals
                            
                                Dynamic maven artifactId
                            
                                Spark + Scala transformations, immutability & memory consumption overheads
                            
                                Scala and Java Options do if present if not do something else construct
                            
                                How to run JavaScript code from within Scala (JVM)?
                            
                                How to register byte[][] using kryo serialization for spark
                            
                                jsonpath find string value in the jsonarray independent of array index
                            
                                If condition in map function
                            
                                Use more than one collect_list in one query in Spark SQL
                            
                                Slick: combine actions with a Seq of DBIOAction
                            
                                How to convert an RDD of Maps to dataframe
                            
                                Akka Stream Option output
                            
                                22 Column limit for procedures
                            
                                scala pattern matching on (Try,Try)
                            
                                SBT dependency for sparkSQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With