How to use maxOffsetsPerTrigger in pyspark structured streaming?

Tags:

apache-kafka

pyspark

I want to limit the rate when fetching data from kafka. My code looks like:

df = spark.read.format('kafka') \
        .option("kafka.bootstrap.servers",'...')\
        .option("subscribe",'A') \
        .option("startingOffsets",'''{"A":{"0":200,"1":200,"2":200}}''') \
        .option("endingOffsets",'''{"A":{"0":400,"1":400,"2":400}}''') \
        .option("maxOffsetsPerTrigger",20) \
        .load() \
        .cache()

However when I call df.count(), the result is 600. What I expected is 20. Does anyone knows why "maxOffsetsPerTrigger" doesn't work.

727

asked Jun 26 '18 00:06

杨嘉辰

1 Answers

You are bringing 200 records per each partition (0, 1, 2), the total is 600 records.

As you can see here:

Use maxOffsetsPerTrigger option to limit the number of records to fetch per trigger.

This means that for each trigger or fetch process Kafka will get 20 records, but in total, you will still fetch the total records set in the configuration (200 per partition).

154

answered Nov 15 '22 11:11

dbustosp

Related questions
                            
                                Column features must be of type org.apache.spark.ml.linalg.VectorUDT
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView
                            
                                Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded
                            
                                How to write dataframe with duplicate column name into a csv file in pyspark
                            
                                Submitting pyspark script to a remote Spark server?
                            
                                List all additional jars loaded in pyspark
                            
                                pyspark 'DataFrame' object has no attribute '_get_object_id'
                            
                                Why joining structure-identic dataframes gives different results?
                            
                                spark scalability: what am I doing wrong?
                            
                                What are the best practices to partition Parquet files by timestamp in Spark?
                            
                                Wrapping a java function in pyspark
                            
                                Split RDD for K-fold validation: pyspark
                            
                                Read random sample of files on S3 with Pyspark
                            
                                Spark with Cython
                            
                                How Spark HashingTF works
                            
                                Spark cosine distance between rows using Dataframe
                            
                                PCA output in Spark doesn't matches with scikit-learn
                            
                                Can't pickle _thread.lock objects Pyspark send request to elasticseach
                            
                                AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options
                            
                                Import TensorFlow data from pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With