Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading an Avro file from scala

Tags:

scala

avro

I'm trying to read an avro file using scala.

I've extracted the file's schema using avro-tools and saved it to a file, I then try to read it using the following code:

 val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
 val schema_obj =  new Schema.Parser
 val schema2 = schema_obj.parse(zibi)
 val READER2 = new GenericDatumReader[GenericRecord](schema2)
 val myFile = Files.readAllBytes(Paths.get("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro"))

 val datum = READER2.read(null, DecoderFactory.defaultFactory.createBinaryDecoder(myFile,null))

But I keep hitting IOExceptions as such:

java.io.IOException: Invalid int encoding
        at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
        at org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
        at org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:444)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:159)
        at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
        at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
        at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:219)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
        at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
        at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

When I'm reading the file through avro-tools it reads just fine.

What am I doing wrong?

like image 905
Daniel Haviv Avatar asked Feb 09 '23 22:02

Daniel Haviv


1 Answers

Try using a DataFileReader instead of using a BinaryDecoder.

While Encoder/Decoders are used for writing and reading raw avros, I suspect that they are choking on the header info found in avro datafiles.

import org.apache.avro.generic.{ GenericDatumReader, GenericRecord }
import org.apache.avro.file.DataFileReader

val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
val schema_obj =  new Schema.Parser
val schema2 = schema_obj.parse(zibi)
val READER2 = new GenericDatumReader[GenericRecord](schema2)

val myFile = new File("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro")
val dataFileReader = new DataFileReader[GenericRecord](myFile, READER2)
val datum = dataFileReader.next()
like image 190
Julian Peeters Avatar answered Feb 16 '23 03:02

Julian Peeters