I want spark to continuously monitor a directory and read the CSV files by using <code>spark.readStream</code> as soon as the file appears in that directory. Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.

As written in official documentation you should use "file" source: <blockquote> File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations. </blockquote> Code example taken from documentation: <pre class="prettyprint"><code>// Read all the csv files written atomically in a directory val userSchema = new StructType().add("name", "string").add("age", "integer") val csvDF = spark .readStream .option("sep", ";") .schema(userSchema) // Specify schema of the csv files .csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory") </code></pre> If you don't specify trigger, Spark will read new files as soon as possible

How to continuously monitor a directory by using Spark Structured Streaming

2 Answers

Here is the complete Solution for this use Case:

If you are running in stand alone mode. You can increase the driver memory as:

bin/spark-shell --driver-memory 4G

No need to set the executor memory as in Stand Alone mode executor runs within the Driver.

As Completing the solution of @T.Gaweda, find the solution below:

val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

csvDf.writeStream.format("console").option("truncate","false").start()

now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.

Note: If you want spark to inferschema you have to first set the following configuration:

spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌e","true")

where spark is your spark session.

answered Sep 25 '22 21:09

Naman Agarwal

As written in official documentation you should use "file" source:

File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.

Code example taken from documentation:

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

If you don't specify trigger, Spark will read new files as soon as possible

answered Sep 24 '22 21:09

T. Gawęda

Related questions
                            
                                Basic Play 2.0 WebSocket request header not found
                            
                                When could Futures be more appropriate than Actors (or vice versa) in Scala?
                            
                                Limit concurrent Web Service request (Or some batch approach)
                            
                                define custom configuration in sbt
                            
                                Lots of nested match ... case in pattern matching
                            
                                How can I create a custom column type with Typesafe Slick in Scala?
                            
                                Class type required but T found
                            
                                How to implement Scala apply method in Java
                            
                                Can't set memory settings for `sbt start`
                            
                                Scala getting field and type of field of a case class
                            
                                How to use Spark SQL DataFrame with flatMap?
                            
                                How to sort an RDD and limit in Spark?
                            
                                jackson why do I need JsonTypeName annotation on subclasses
                            
                                How to copy some files to the build target directory with SBT?
                            
                                Scala for comprehension with future and options
                            
                                Convert JsDefined to String
                            
                                Getting the current Instant in a specific TimeZone
                            
                                Scala build tools SBT vs CBT [closed]
                            
                                How to display a KeyValueGroupedDataset in Spark?
                            
                                deprecation warning when compiling: eta expansion of zero argument method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to continuously monitor a directory by using Spark Structured Streaming

Tags:

scala

apache-spark

spark-structured-streaming

Naman Agarwal

People also ask

2 Answers

Naman Agarwal

T. Gawęda

Recent Activity

Donate For Us