Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:

val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema 
like image 806
Arnon Rodman Avatar asked Jan 20 '18 21:01

Arnon Rodman


1 Answers

It is possible using this construct:

myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..

How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream

What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.

Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)

like image 163
Aydin K. Avatar answered Oct 08 '22 17:10

Aydin K.