Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to read avro files in python 3.5.2

I am trying to read avro files using python.

I installed Apache Avro successfully (I think I did because I am able to "import avro" in the python shell) following the instruction here

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

However, when I try to read avro files following the code in the above instruction. I keep receiving errors when importing avro related stuff.

>>> import avro.schema
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
import avro.schema
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 954, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 896, in _find_spec
File "<frozen importlib._bootstrap_external>", line 1139, in find_spec
File "<frozen importlib._bootstrap_external>", line 1115, in _get_spec
File "<frozen importlib._bootstrap_external>", line 1096, in _legacy_get_spec
File "<frozen importlib._bootstrap>", line 444, in spec_from_loader
File "<frozen importlib._bootstrap_external>", line 533, in spec_from_file_location
File "I:\Program Files\lib\site-packages\avro-_avro_version_-py3.5.egg\avro\schema.py", line 340
except Exception, e:
                ^
SyntaxError: invalid syntax


>>> from avro.datafile import DataFileReader, DataFileWriter
Traceback (most recent call last):
File "I:\Program Files\lib\site-packages\avro-_avro_version_-py3.5.egg\avro\datafile.py", line 21, in <module>
from cStringIO import StringIO
ImportError: No module named 'cStringIO'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
from avro.datafile import DataFileReader, DataFileWriter
File "I:\Program Files\lib\site-packages\avro-_avro_version_-py3.5.egg\avro\datafile.py", line 23, in <module>
from StringIO import StringIO
ImportError: No module named 'StringIO'


>>> from avro.io import DatumReader, DatumWriter
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
from avro.io import DatumReader, DatumWriter
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 954, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 896, in _find_spec
File "<frozen importlib._bootstrap_external>", line 1139, in find_spec
File "<frozen importlib._bootstrap_external>", line 1115, in _get_spec
File "<frozen importlib._bootstrap_external>", line 1096, in _legacy_get_spec
File "<frozen importlib._bootstrap>", line 444, in spec_from_loader
File "<frozen importlib._bootstrap_external>", line 533, in spec_from_file_location
File "I:\Program Files\lib\site-packages\avro-_avro_version_-py3.5.egg\avro\io.py", line 200
bits = (((ord(self.read(1)) & 0xffL)) |
                                  ^
SyntaxError: invalid syntax

So did I install avro successfully? Why am I receiving those errors? I am using python 3.5.2 on windows 7.

Edited I fixed the issue following the suggestion by Stephane Martin. Then I try to read avro files into python. I have a bunch of avros in a directory which has already been set as the right path in the python. Here is my code

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

And it returns the error

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to

I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case?

like image 591
Tracy Yang Avatar asked Nov 22 '16 01:11

Tracy Yang


People also ask

How do I read an Avro file in Python?

Even if you install the correct Avro package for your Python environment, the API differs between avro and avro-python3 . As an example, for Python 2 (with avro package), you need to use the function avro. schema. parse but for Python 3 (with avro-python3 package), you need to use the function avro.

How do I view Avro files?

An easy way to explore Avro files is by using the Avro Tools jar from Apache.

Can we read Avro file?

Avro is a file type that is often use because it is highly compact and fast to read. It is used by Apache Kafka, Apache Hadoop, and other data intensive applications. Boomi integrations are not currently able to read and write avro data. Although, this is possible with Boomi Data Catalog and Prep.


2 Answers

With recent versions of the avro package, this should no longer be an issue.


Original answer:

When installing through pip or a similar package manager: install the avro-python3 package instead of just avro.

like image 199
Thomas Avatar answered Nov 06 '22 05:11

Thomas


Use the Avro distribution for python 3, not the one for python 2.

http://apache.mediamirrors.org/avro/stable/py3/

like image 43
Stephane Martin Avatar answered Nov 06 '22 05:11

Stephane Martin