Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - write Avro file

What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:

  1. parse some logs files from HDFS
  2. for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
  3. write Avro files to HDFS

I tried to use spark-avro, but it doesn't help much.

val someLogs = sc.textFile(inputPath)

val rowRDD = someLogs.map { line =>
  createRow(...)
}

val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)

This fails with error:

org.apache.spark.sql.AnalysisException: 
      Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...
like image 258
iuliandumitru Avatar asked Nov 23 '15 18:11

iuliandumitru


People also ask

How do I save an Avro file in spark?

Since spark-avro module is external, there is no . avro API in DataFrameReader or DataFrameWriter . To load/save data in Avro format, you need to specify the data source option format as avro (or org. apache.

What is Avro format in spark?

Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects. . Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines. Last Updated: 06 Jun 2022. Get access to Big Data projects View all Big Data projects.


2 Answers

Databricks provided library spark-avro, which helps us in reading and writing Avro data.

dataframe.write.format("com.databricks.spark.avro").save(outputPath)
like image 166
Sudheer Palyam Avatar answered Oct 06 '22 01:10

Sudheer Palyam


Spark 2 and Scala 2.11

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Do all your operations and save it on your Dataframe say (dataFrame)

dataFrame.write.avro("/tmp/output")

Maven dependency

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version> 
</dependency>
like image 29
Debaditya Avatar answered Oct 06 '22 01:10

Debaditya