How to read streaming data in XML format from Kafka?

Tags:

I am trying to read XML data from Kafka topic using Spark Structured streaming.

I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?

My current code:

df = spark \
      .readStream \
      .format("kafka") \
      .format('com.databricks.spark.xml') \
      .options(rowTag="MainElement")\
      .option("kafka.bootstrap.servers", "localhost:9092") \
      .option(subscribeType, "test") \
      .load()

The error:

py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
        at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)

854

asked Sep 01 '17 16:09

ranjith reddy

1 Answers

.format("kafka") \
.format('com.databricks.spark.xml') \

The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).

In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.

As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.

Is there any way I can extract XML data from Kafka topic using structured streaming?

You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.

That should not be a big deal anyway. A Scala code could look as follows.

val input = spark.
  readStream.
  format("kafka").
  ...
  load

val values = input.select('value cast "string")

val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))

// print XMLs and numbers to the stdout
val q = numbersFromXML.
  writeStream.
  format("console").
  start

Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.

182

answered Oct 10 '22 03:10

Jacek Laskowski

Related questions
                            
                                How to submit a Scala job to Spark?
                            
                                Yarn container is running out of memory
                            
                                Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?
                            
                                Error when creating a StreamingContext
                            
                                Register UDF to SqlContext from Scala to use in PySpark
                            
                                pandas str.contains in pyspark dataframe in Pyspark
                            
                                How to define Kafka (data source) dependencies for Spark Streaming?
                            
                                Spark 2.0 DataSets groupByKey and divide operation and type safety
                            
                                SPARK, DataFrame: difference of Timestamp columns over consecutive rows
                            
                                spark kafka producer serializable
                            
                                SPARK: YARN kills containers for exceeding memory limits
                            
                                Sort by dateTime in scala
                            
                                Spark Dataframes- Reducing By Key
                            
                                How to reference a dataframe when in an UDF on another dataframe?
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?
                            
                                Scala/Spark dataframes: find the column name corresponding to the max
                            
                                Apache Spark how to append new column from list/array to Spark dataframe
                            
                                Pyspark: Is there an equivalent method to pandas info()?
                            
                                Getting last value of group in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read streaming data in XML format from Kafka?

Tags:

xml-parsing

apache-spark

pyspark-sql

spark-structured-streaming

ranjith reddy

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us