How to use foreach or foreachBatch in PySpark to write to database?

Tags:

I want to do Spark Structured Streaming (Spark 2.4.x) from a Kafka source to a MariaDB with Python (PySpark).

I want to use the streamed Spark dataframe and not the static nor Pandas dataframe.

It seems that one has to use foreach or foreachBatch since there are no possible database sinks for streamed dataframes according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks.

Here is my try:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, TimestampType
from pyspark.sql import DataFrameWriter
# configuration of target db
db_target_url = "jdbc:mysql://localhost/database"
db_target_properties = {"user":"writer", "password":"1234"}
# schema
schema_simple = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])

# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()

# create DataFrame representing the stream
df = spark.readStream \
  .format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "mytopic") \
  .load() \
  .selectExpr("Timestamp", "cast (value as string) as json") \
  .select("Timestamp", F.from_json("json", schema_simple).alias('json_wrapper')) \
  .selectExpr("Timestamp", "json_wrapper.Signal", "json_wrapper.Value")
df.printSchema()
# Do some dummy processing
df2 = df.filter("Value < 11111111111")
print("df2: ", df2.isStreaming)

def process_row(row):
    # Process row
    row.write.jdbc(url=db_target_url, table="mytopic", mode="append", properties=db_target_properties)
    pass
query = df2.writeStream.foreach(process_row).start()

I get an error:

AttributeError: write

Why?

736

asked Nov 08 '19 12:11

tardis

Video Answer

2 Answers

tl;dr Replace foreach with foreachBatch.

Quoting the official documentation:

The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.

In other words, your writeStream.foreach(process_row) acts on a single row (of data) that has no write.jdbc available and hence the error.

Think of the row as a piece of data that you can save anywhere you want using any API you want.

If you really need support from Spark (and do use write.jdbc) you should actually use foreachBatch.

while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.

answered Nov 15 '22 10:11

Jacek Laskowski

With the support of Jacek, I could fix my example:

def process_row(df, epoch_id):
    df2.write.jdbc(url=db_target_url, table="mytopic", mode="append", properties=db_target_properties)
    pass
query = df2.writeStream.foreachBatch(process_row).start()

You also must put the epoch_id into the function parameters. Otherwise you get errors in the spark log file that are not shown in the jupyter notebook.

answered Nov 15 '22 11:11

tardis

Related questions
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                Redirect Spark console logs into a file
                            
                                How to expire state of dropDuplicates in structured streaming to avoid OOM?
                            
                                Workaround for importing spark implicits everywhere
                            
                                spark-submit Error: No main class set in JAR; please specify one with --class
                            
                                java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
                            
                                Does Kryo help in SparkSQL?
                            
                                StackOverflowError when operating with a large number of columns in Spark
                            
                                How to write a Dataset to Kafka topic?
                            
                                how to use spark lag and lead over group by and order by
                            
                                overwrite column values using other column values based on conditions pyspark
                            
                                Spark csv reading speed is very slow although I increased the number of nodes
                            
                                outlier detection in pyspark
                            
                                Apache Spark and Nifi Integration
                            
                                Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord")
                            
                                Adding a new column in the first ordinal position in a pyspark dataframe
                            
                                Spark RDD partition by key in exclusive way
                            
                                Pyspark Error:- dataType <class 'pyspark.sql.types.StringType'> should be an instance of <class 'pyspark.sql.types.DataType'>
                            
                                aws: EMR cluster fails "ERROR UserData: Error encountered while try to get user data" on submitting spark job

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use foreach or foreachBatch in PySpark to write to database?

Tags:

apache-kafka

apache-spark

pyspark

spark-structured-streaming

tardis

People also ask

Video Answer

2 Answers

Jacek Laskowski

tardis

Recent Activity

Donate For Us