I have read an avro file into spark RDD and need to conver that into a sql dataframe. how do I do that.
This is what I did so far.
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable
val path = "hdfs://dds-nameservice/user/ghagh/"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
When I do:
avro.take(1)
I get back
res1: Array[(org.apache.avro.mapred.AvroWrapper[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)] = Array(({"column1": "value1", "column2": "value2", "column3": value3,...
How do I convert this to a SparkSQL dataframe?
I am using Spark 1.6
Can anyone tell me if there is an easy solution around this?
The avro file needs to be confirmed into a file type that Boomi is able to read and write. In this example we are using json as that file type. The scripts below have been successful ran on local atoms but has not been tested on cloud atoms. You will also need to install the Apache Avro jar files.
Since Spark 2.4 release, Spark SQL provides built-in support for reading and writing Apache Avro data.
For DataFrame
I'd go with Avro data source directly:
Include spark-avro in packages list. For the latest version use:
com.databricks:spark-avro_2.11:3.2.0
Load the file:
val df = spark.read
.format("com.databricks.spark.avro")
.load(path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With