Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Read AVRO file using Python

I have an AVRO file(created by JAVA) and seems like it is some kind of zipped file for hadoop/mapreduce, i want to 'unzip' (deserialize) it to a flat file. Per record per row.

I learned that there is an AVRO package for python, and I installed it correctly. And run the example to read the AVRO file. However, it came up with the errors below and I am wondering what is going on reading the simplest example? Can anyone help me interpret the errors bellow.

>>> reader = DataFileReader(open("/tmp/Stock_20130812104524.avro", "r"), DatumReader())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python2.7/site-packages/avro/datafile.py", line 240, in __init__
    raise DataFileException('Unknown codec: %s.' % self.codec)
avro.datafile.DataFileException: Unknown codec: snappy.

btw, if I do 'head' of file, and using VI to open up the first few lines of the AVRO file, I could see the schema definition together with some crappy weird characters - probably the zipped content. The starting bit of the raw AVRO file looks like below:


I don't know if those schemas would be necessary to read the AVRO file, something like below:

schema = avro.schema.parse(open("schema").read())
# include schema to do sth...
reader = DataFileReader(open("Stock_20130812104524.avro", "r"), DatumReader())

Thanks in advance.

like image 429
B.Mr.W. Avatar asked Mar 23 '23 15:03


1 Answers

Try pip install python-snappy - make sure you have installed snappy first.

like image 60
chlunde Avatar answered Apr 01 '23 00:04
