I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile()
.
I have attempted to do the following:
sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")
This should match the following:
/user/Orders/201507270010033.gz /user/Orders/201507300060052.gz
Any idea how to achieve this?
Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.
PySpark SQL rlike() Function Example rlike() evaluates the regex on Column value and returns a Column of type Boolean. rlike() is a function on Column type, for more examples refer to PySpark Column Type & it's Functions.
textFile is a method of a org. apache. spark. SparkContext class that reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
You can pass a list of CSVs with their paths to spark read api like spark. read. json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.
Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat
.
Searching reveals that paths supplied to FileInputFormat
's addInputPath
or setInputPath
"may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext
also uses those APIs to set the path.
The syntax of the glob includes:
*
(match 0 or more character)?
(match single character)[ab]
(character class)[^ab]
(negated character class)[a-b]
(character range){a,b}
(alternation)\c
(escape character)Following the example in the accepted answer, it is possible to write your path as:
sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")
It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:
sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With