I am writing a spark job using python. However, I need to read in a whole bunch of avro files.
This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the command line of spark-submit, you can specify the driver-class, in that case, all your avrokey, avrovalue class will be located.
avro_rdd = sc.newAPIHadoopFile(
path,
"org.apache.avro.mapreduce.AvroKeyInputFormat",
"org.apache.avro.mapred.AvroKey",
"org.apache.hadoop.io.NullWritable",
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
conf=conf)
In my case, I need to run everything within the Python script, I have tried to create an environment variable to include the jar file, finger cross Python will add the jar to the path but clearly it is not, it is giving me unexpected class error.
os.environ['SPARK_SUBMIT_CLASSPATH'] = "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar"
Can anyone help me how to read avro file in one python script?
Even if you install the correct Avro package for your Python environment, the API differs between avro and avro-python3 . As an example, for Python 2 (with avro package), you need to use the function avro. schema. parse but for Python 3 (with avro-python3 package), you need to use the function avro.
Spark >= 2.4.0
You can use built-in Avro support. The API is backwards compatible with the spark-avro
package, with a few additions (most notably from_avro
/ to_avro
function).
Please note that module is not bundled with standard Spark binaries and has to be included using spark.jars.packages
or equivalent mechanism.
See also Pyspark 2.4.0, read avro from kafka with read stream - Python
Spark < 2.4.0
You can use spark-avro
library. First lets create an example dataset:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
schema_string ='''{"namespace": "example.avro",
"type": "record",
"name": "KeyValue",
"fields": [
{"name": "key", "type": "string"},
{"name": "value", "type": ["int", "null"]}
]
}'''
schema = avro.schema.parse(schema_string)
with open("kv.avro", "w") as f, DataFileWriter(f, DatumWriter(), schema) as wrt:
wrt.append({"key": "foo", "value": -1})
wrt.append({"key": "bar", "value": 1})
Reading it using spark-csv
is as simple as this:
df = sqlContext.read.format("com.databricks.spark.avro").load("kv.avro")
df.show()
## +---+-----+
## |key|value|
## +---+-----+
## |foo| -1|
## |bar| 1|
## +---+-----+
The former solution requires to install a third-party Java dependency, which is not something most Python devs are happy with. But you don't really need an external library if all you want to do is parse your Avro files with a given schema. You can just read the binary files and parse them with your favorite python Avro package.
For instance, this is how you can load Avro files using fastavro
:
from io import BytesIO
import fastavro
schema = {
...
}
rdd = sc.binaryFiles("/path/to/dataset/*.avro")\
.flatMap(lambda args: fastavro.reader(BytesIO(args[1]), reader_schema=schema))
print(rdd.collect())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With