I have a client which places the CSV files in Nested Directories as below, I need to read these files in real-time. I am trying to do this using Spark Structured Streaming.
Data:
/user/data/1.csv
/user/data/2.csv
/user/data/3.csv
/user/data/sub1/1_1.csv
/user/data/sub1/1_2.csv
/user/data/sub1/sub2/2_1.csv
/user/data/sub1/sub2/2_2.csv
Code:
val csvDF = spark
.readStream
.option("sep", ",")
.schema(userSchema) // Schema of the csv files
.csv("/user/data/")
Any configurations to be added to allow spark reading from nested directories in Structured Streaming.
I am able to stream the files in sub-directories using glob path.
Posting here for the sake of others.
inputPath = "/spark_structured_input/*?*"
inputDF = spark.readStream.option("header", "true").schema(userSchema).csv(inputPath)
query = inputDF.writeStream.format("console").start()
As far as I know, Spark has no such options, but it supports glob usage in path.
val csvDF = spark
.readStream
.option("sep", ",")
.schema(userSchema) // Schema of the csv files
.csv("/user/data/*/*")
Maybe it may help you to design your glob path and use it in one stream.
Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With