I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming. How to create a streaming data source in Spark Structured Streaming?

A streaming data source implements org.apache.spark.sql.execution.streaming.Source. The scaladoc of <code>org.apache.spark.sql.execution.streaming.Source</code> should give you enough information to get started (just follow the types to develop a compilable Scala type). Once you have the <code>Source</code> you have to register it so you can use it in <code>format</code> of a <code>DataStreamReader</code>. The trick to make the streaming source available so you can use it for <code>format</code> is to register it by creating the <code>DataSourceRegister</code> for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister: <pre class="prettyprint"><code>org.apache.spark.sql.execution.datasources.csv.CSVFileFormat org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider org.apache.spark.sql.execution.datasources.json.JsonFileFormat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat org.apache.spark.sql.execution.datasources.text.TextFileFormat org.apache.spark.sql.execution.streaming.ConsoleSinkProvider org.apache.spark.sql.execution.streaming.TextSocketSourceProvider org.apache.spark.sql.execution.streaming.RateSourceProvider </code></pre> That's the file that links the short name in <code>format</code> to the implementation. What I usually recommend people doing during my Spark workshops is to start development from both sides: <ol> <li> Write the streaming query (with <code>format</code>), e.g. <pre class="prettyprint"><code>val input = spark .readStream .format("yourCustomSource") // <-- your custom source here .load </code></pre> </li> <li>Implement the streaming <code>Source</code> and a corresponding <code>DataSourceRegister</code> (it could be the same class)</li> <li> (optional) Register the <code>DataSourceRegister</code> by writing the fully-qualified class name, say <code>com.mycompany.spark.MyDataSourceRegister</code>, to <code>META-INF/services/org.apache.spark.sql.sources.DataSourceRegister</code>: <pre class="prettyprint"><code>$ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister com.mycompany.spark.MyDataSourceRegister </code></pre> </li> </ol> The last step where you register the <code>DataSourceRegister</code> implementation for your custom <code>Source</code> is optional and is only to register the data source alias that your end users use in DataFrameReader.format method. <blockquote> format(source: String): DataFrameReader Specifies the input data source format. </blockquote> Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.

How to create a custom streaming data source?

Video Answer

1 Answers

A streaming data source implements org.apache.spark.sql.execution.streaming.Source.

The scaladoc of org.apache.spark.sql.execution.streaming.Source should give you enough information to get started (just follow the types to develop a compilable Scala type).

Once you have the Source you have to register it so you can use it in format of a DataStreamReader. The trick to make the streaming source available so you can use it for format is to register it by creating the DataSourceRegister for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:

org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider

That's the file that links the short name in format to the implementation.

What I usually recommend people doing during my Spark workshops is to start development from both sides:

Write the streaming query (with format), e.g.

val input = spark
  .readStream
  .format("yourCustomSource") // <-- your custom source here
  .load

Implement the streaming Source and a corresponding DataSourceRegister (it could be the same class)
(optional) Register the DataSourceRegister by writing the fully-qualified class name, say com.mycompany.spark.MyDataSourceRegister, to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
```
$ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
com.mycompany.spark.MyDataSourceRegister
```

The last step where you register the DataSourceRegister implementation for your custom Source is optional and is only to register the data source alias that your end users use in DataFrameReader.format method.

format(source: String): DataFrameReader Specifies the input data source format.

Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.

106

answered Sep 19 '22 08:09

Jacek Laskowski

Related questions
                            
                                Why persist () are lazily evaluated in Spark
                            
                                What happens when an executor is lost?
                            
                                Parquet vs Cassandra using Spark and DataFrames
                            
                                Boosting spark.yarn.executor.memoryOverhead
                            
                                How to filter rows for a specific aggregate with spark sql?
                            
                                How to aggregate over rolling time window with groups in Spark
                            
                                spark sbt error: value toDF is not a member of Seq[DataRow]
                            
                                What is Lineage In Spark?
                            
                                How to refresh a table and do it concurrently?
                            
                                How to get the output from console streaming sink in Zeppelin?
                            
                                py4j.protocol.Py4JJavaError occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe
                            
                                How to drop a column from a Databricks Delta table?
                            
                                Spark: optimise writing a DataFrame to SQL Server
                            
                                What is Memory reserved on Yarn
                            
                                Pyspark py4j PickleException: "expected zero arguments for construction of ClassDict"
                            
                                How to sort by value efficiently in PySpark?
                            
                                Create pyspark kernel for Jupyter
                            
                                Do you benefit from the Kryo serializer when you use Pyspark?
                            
                                Spark Dataframe change column value
                            
                                How to read gz compressed file by pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a custom streaming data source?

Tags:

apache-spark

spark-structured-streaming

szu

People also ask

Video Answer

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us