Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark 2.4.0, read avro from kafka with read stream - Python

I am trying to read avro messages from Kafka, using PySpark 2.4.0.

The spark-avro external module can provide this solution for reading avro files:

df = spark.read.format("avro").load("examples/src/main/resources/users.avro") 
df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")

However, I need to read streamed avro messages. The library documentation suggests using the from_avro() function, which is only available for Scala and Java.

Are there any other modules that support reading avro messages streamed from Kafka?

like image 732
Panagiotis Fytas Avatar asked Feb 14 '19 14:02

Panagiotis Fytas


People also ask

How do I read avro data from Kafka topic?

You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic. txt to the console to read it. The connector name controls if a new consumer group is formed in distributed mode.


1 Answers

You can include spark-avro package, for example using --packages (adjust versions to match spark installation):

bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0

and provide your own wrappers:

from pyspark.sql.column import Column, _to_java_column 

def from_avro(col, jsonFormatSchema): 
    sc = SparkContext._active_spark_context 
    avro = sc._jvm.org.apache.spark.sql.avro
    f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
    return Column(f(_to_java_column(col), jsonFormatSchema)) 


def to_avro(col): 
    sc = SparkContext._active_spark_context 
    avro = sc._jvm.org.apache.spark.sql.avro
    f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
    return Column(f(_to_java_column(col))) 

Example usage (adopted from the official test suite):

from pyspark.sql.functions import col, struct


avro_type_struct = """
{
  "type": "record",
  "name": "struct",
  "fields": [
    {"name": "col1", "type": "long"},
    {"name": "col2", "type": "string"}
  ]
}"""


df = spark.range(10).select(struct(
    col("id"),
    col("id").cast("string").alias("id2")
).alias("struct"))
avro_struct_df = df.select(to_avro(col("struct")).alias("avro"))
avro_struct_df.show(3)
+----------+
|      avro|
+----------+
|[00 02 30]|
|[02 02 31]|
|[04 02 32]|
+----------+
only showing top 3 rows
avro_struct_df.select(from_avro("avro", avro_type_struct)).show(3)
+------------------------------------------------+
|from_avro(avro, struct<col1:bigint,col2:string>)|
+------------------------------------------------+
|                                          [0, 0]|
|                                          [1, 1]|
|                                          [2, 2]|
+------------------------------------------------+
only showing top 3 rows
like image 61
10465355 Avatar answered Oct 17 '22 08:10

10465355