Spark - write Avro file

Tags:

apache-spark

avro

What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:

parse some logs files from HDFS
for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
write Avro files to HDFS

I tried to use spark-avro, but it doesn't help much.

val someLogs = sc.textFile(inputPath)

val rowRDD = someLogs.map { line =>
  createRow(...)
}

val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)

This fails with error:

org.apache.spark.sql.AnalysisException: 
      Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...

258

asked Nov 23 '15 18:11

iuliandumitru

2 Answers

Databricks provided library spark-avro, which helps us in reading and writing Avro data.

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

166

answered Oct 06 '22 01:10

Sudheer Palyam

Spark 2 and Scala 2.11

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Do all your operations and save it on your Dataframe say (dataFrame)

dataFrame.write.avro("/tmp/output")

Maven dependency

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version> 
</dependency>

answered Oct 06 '22 01:10

Debaditya

Related questions
                            
                                Filter out rows with NaN values for certain column
                            
                                How to connect to Amazon Redshift or other DB's in Apache Spark?
                            
                                Spark Shell stuck in YARN Accepted state
                            
                                Calculate a grouped median in pyspark
                            
                                spark scala : Convert Array of Struct column to String column
                            
                                spark select and add columns with alias
                            
                                What does withReplacement do, if specified for sample against a Spark Dataframe
                            
                                Apache Spark: dealing with Option/Some/None in RDDs
                            
                                How to access local files in Spark on Windows?
                            
                                GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
                            
                                Concatenate Sparse Vectors in Spark?
                            
                                JSON file parsing in Pyspark
                            
                                How to check if array column is inside another column array in PySpark dataframe
                            
                                Count number of columns in pyspark Dataframe?
                            
                                How to concatenate/append multiple Spark dataframes column wise in Pyspark?
                            
                                Spark _temporary creation reason
                            
                                How to convert empty arrays to nulls?
                            
                                Escape New line character in Spark CSV read
                            
                                Python pandas_udf spark error
                            
                                repartition() is not affecting RDD partition size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With