Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regex to include/exclude some input files in sc.textFile?

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().

I have attempted to do the following:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*") 

This should match the following:

/user/Orders/201507270010033.gz /user/Orders/201507300060052.gz 

Any idea how to achieve this?

like image 867
eboni Avatar asked Aug 03 '15 08:08

eboni


People also ask

How do I read multiple text files in RDD?

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

What is Rlike in PySpark?

PySpark SQL rlike() Function Example rlike() evaluates the regex on Column value and returns a Column of type Boolean. rlike() is a function on Column type, for more examples refer to PySpark Column Type & it's Functions.

What does SC textFile return?

textFile is a method of a org. apache. spark. SparkContext class that reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

How do I read files in parallel spark?

You can pass a list of CSVs with their paths to spark read api like spark. read. json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.


1 Answers

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

  • * (match 0 or more character)
  • ? (match single character)
  • [ab] (character class)
  • [^ab] (negated character class)
  • [a-b] (character range)
  • {a,b} (alternation)
  • \c (escape character)

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*") 

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*") 
like image 176
nhahtdh Avatar answered Sep 29 '22 07:09

nhahtdh