I want to write my collection to .parquet file, so that it can be later read using Spark.
So far I am creating file with this code:
package com.contrib.parquet
import org.apache.avro.SchemaBuilder
import org.apache.avro.reflect.ReflectData
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.AvroParquetWriter
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
object ParquetWriter {
def main(args: Array[String]): Unit = {
val schema = SchemaBuilder
.record("Record")
.fields()
.requiredString("name")
.requiredInt("id")
.endRecord()
val writer: ParquetWriter[Record] = AvroParquetWriter
.builder(new Path("/tmp/parquetResult"))
.withConf(new Configuration)
.withDataModel(ReflectData.get)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(schema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.build()
Seq(Record("nameOne", 1), Record("nameTwo", 2)).foreach(writer.write)
writer.close()
}
case class Record(name: String, id: Int)
}
which creates a parquet file successfully.
When I try to read that file using spark I get java.lang.NoSuchMethodError: org.apache.parquet.column.values.ValuesReader.initFromPage
error.
Spark code:
val master = "local[4]"
val sparkCtx = SparkSession
.builder()
.appName("ParquetReader")
.master(master)
.getOrCreate()
val schema = Encoders.product[Record].schema
val df = sparkCtx.read.parquet("/tmp/parquetResult")
df.show(100, false)
How do I write Parquet files so that they can be read using Spark? I don't want to have local Spark app just to write this file.
We ended up with using open source library: https://github.com/mjakubowski84/parquet4s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With