How to stop spark structured streaming from listing all files in an S3 bucket every time

Tags:

I have a structured streaming job on pyspark which does some aggregations on a filesource. I have a kinesis firehose combines the data from an IoT type application and stores the data on an S3 location as a file per minute in different folders in the following folder structure -

s3://year/month/day/hour/

My spark structured streaming job seems to hat from listing all the files that are available in my S3 bucket. As the listing process seems to be taking more time than the processingTime that I've set. I get the following warning, I was wondering if there was a way to not let this happen.

18/06/15 14:28:35 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 60000 milliseconds, but spent 74364 milliseconds
18/06/15 14:28:42 WARN FileStreamSource: Listed 4449 file(s) in 6822.134244 ms
18/06/15 14:29:06 WARN FileStreamSource: Listed 4449 file(s) in 6478.381219 ms
18/06/15 14:30:08 WARN FileStreamSource: Listed 4450 file(s) in 8285.654031 ms

602

asked Jun 15 '18 14:06

ArunK

2 Answers

The S3 API List operation can only be used to retrieve all object keys in a bucket sharing a prefix. So it's simply impossible to list only new, unprocessed objects. The Databricks folks seem to have a solution where you set up S3 to create an SQS record when a new object is created. Spark then checks SQS for new objects and retrieves specific objects from S3 (i.e. no listing involved). Unfortunately this connector seems to be available only on Databricks clusters and hasn't been open sourced, so if you're using for example EMR, you can't use it (unless of course you implement the connector yourself).

154

answered Nov 15 '22 16:11

lfk

A comment in the class FileStreamSource:

// Output a warning when listing files uses more than 2 seconds.

So, to get rid of this warning, you could reduce the amount of files processed every trigger:

maxFilesPerTrigger option can be set on the file source to ensure it takes < 2 seconds.

The first warning is the trigger interval you have set (60000) is shorter than the time taken (74364). Just increase the trigger interval to get rid of this.

answered Nov 15 '22 15:11

bp2010

Related questions
                            
                                How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?
                            
                                pyspark equivalence of `df.loc`?
                            
                                Calling a rest service from Spark
                            
                                Does Spark support BigInteger type?
                            
                                Failed to execute user defined function($anonfun$9: (string) => double) on using String Indexer for multiple columns
                            
                                Spark: Prevent shuffle/exchange when joining two identically partitioned dataframes
                            
                                How to set hive.metastore.warehouse.dir in HiveContext?
                            
                                Spark SQL grouping: Add to group by or wrap in first() if you don't care which value you get.;
                            
                                How to extract rules from decision tree spark MLlib
                            
                                Custom log4j appender in spark executor
                            
                                Uncaught Exception Handling in Spark
                            
                                Why can I not read from the AWS S3 in Spark application anymore?
                            
                                Spark Worker node stops automatically
                            
                                Resolving "Kryo serialization failed: Buffer overflow" Spark exception
                            
                                How to compute the distance matrix in spark?
                            
                                Spark-submit master url and SparkSession master url in the main class, what is difference?
                            
                                null value and countDistinct with spark dataframe
                            
                                How does Apache Spark send functions to other machines under the hood
                            
                                spark on yarn, Connecting to ResourceManager at /0.0.0.0:8032
                            
                                How to setup Spark with a multi node Cassandra cluster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to stop spark structured streaming from listing all files in an S3 bucket every time

Tags:

amazon-s3

apache-spark

ArunK

People also ask

2 Answers

lfk

bp2010

Recent Activity

Donate For Us