How to read Avro file in PySpark

Tags:

I am writing a spark job using python. However, I need to read in a whole bunch of avro files.

This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the command line of spark-submit, you can specify the driver-class, in that case, all your avrokey, avrovalue class will be located.

avro_rdd = sc.newAPIHadoopFile(
        path,
        "org.apache.avro.mapreduce.AvroKeyInputFormat",
        "org.apache.avro.mapred.AvroKey",
        "org.apache.hadoop.io.NullWritable",
        keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
        conf=conf)

In my case, I need to run everything within the Python script, I have tried to create an environment variable to include the jar file, finger cross Python will add the jar to the path but clearly it is not, it is giving me unexpected class error.

os.environ['SPARK_SUBMIT_CLASSPATH'] = "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar"

Can anyone help me how to read avro file in one python script?

360

asked Apr 20 '15 22:04

B.Mr.W.

2 Answers

Spark >= 2.4.0

You can use built-in Avro support. The API is backwards compatible with the spark-avro package, with a few additions (most notably from_avro / to_avro function).

Please note that module is not bundled with standard Spark binaries and has to be included using spark.jars.packages or equivalent mechanism.

See also Pyspark 2.4.0, read avro from kafka with read stream - Python

Spark < 2.4.0

You can use spark-avro library. First lets create an example dataset:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter

schema_string ='''{"namespace": "example.avro",
 "type": "record",
 "name": "KeyValue",
 "fields": [
     {"name": "key", "type": "string"},
     {"name": "value",  "type": ["int", "null"]}
 ]
}'''

schema = avro.schema.parse(schema_string)

with open("kv.avro", "w") as f, DataFileWriter(f, DatumWriter(), schema) as wrt:
    wrt.append({"key": "foo", "value": -1})
    wrt.append({"key": "bar", "value": 1})

Reading it using spark-csv is as simple as this:

df = sqlContext.read.format("com.databricks.spark.avro").load("kv.avro")
df.show()

## +---+-----+
## |key|value|
## +---+-----+
## |foo|   -1|
## |bar|    1|
## +---+-----+

answered Sep 18 '22 19:09

zero323

The former solution requires to install a third-party Java dependency, which is not something most Python devs are happy with. But you don't really need an external library if all you want to do is parse your Avro files with a given schema. You can just read the binary files and parse them with your favorite python Avro package.

For instance, this is how you can load Avro files using fastavro:

from io import BytesIO
import fastavro

schema = {
    ...
}

rdd = sc.binaryFiles("/path/to/dataset/*.avro")\
    .flatMap(lambda args: fastavro.reader(BytesIO(args[1]), reader_schema=schema))

print(rdd.collect())

answered Sep 20 '22 19:09

Régis B.

Related questions
                            
                                2.7 CSV module wants unicode, but doesn't want unicode
                            
                                Auto-creating related objects on model creation in Django
                            
                                How to use unicode characters with PIL?
                            
                                Kivy to Apk in Windows
                            
                                How do I concatenate many objects into one object using inheritance in python? (during runtime)
                            
                                How to disable Flask-Cache caching
                            
                                Python implementation of the laplacian of gaussian edge detection
                            
                                Python multiprocessing - watch a process and restart it when fails
                            
                                Choose at random from combinations
                            
                                Python Non negative Matrix Factorization that handles both zeros and missing data?
                            
                                What does PuLP LpStatus=Undefined actually mean?
                            
                                Using custom methods in filter with django-rest-framework
                            
                                Generating low discrepancy quasi-random sequences in python/numpy/scipy?
                            
                                How to test coverage properly with Django + Nose
                            
                                Python: strftime() UTC Offset Not working as Expected in Windows
                            
                                Installing Pylab/Matplotlib
                            
                                How does one print a Unicode character code in Python?
                            
                                how to directly import now() from datetime.datetime submodule
                            
                                SAML 2.0 Service Provider in Python
                            
                                Multi-index pivoting in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read Avro file in PySpark

Tags:

python

apache-spark

pyspark

avro

B.Mr.W.

People also ask

2 Answers

zero323

Régis B.

Recent Activity

Donate For Us