What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:
I tried to use spark-avro, but it doesn't help much.
val someLogs = sc.textFile(inputPath)
val rowRDD = someLogs.map { line =>
createRow(...)
}
val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)
This fails with error:
org.apache.spark.sql.AnalysisException:
Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...
Since spark-avro module is external, there is no . avro API in DataFrameReader or DataFrameWriter . To load/save data in Avro format, you need to specify the data source option format as avro (or org. apache.
Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects. . Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines. Last Updated: 06 Jun 2022. Get access to Big Data projects View all Big Data projects.
Databricks provided library spark-avro, which helps us in reading and writing Avro data.
dataframe.write.format("com.databricks.spark.avro").save(outputPath)
Spark 2 and Scala 2.11
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
// Do all your operations and save it on your Dataframe say (dataFrame)
dataFrame.write.avro("/tmp/output")
Maven dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With