I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function <code>sc.textFile()</code>. I have attempted to do the following: <pre class="prettyprint"><code>sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*") </code></pre> This should match the following: <pre class="prettyprint"><code>/user/Orders/201507270010033.gz /user/Orders/201507300060052.gz </code></pre> Any idea how to achieve this?

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's <code>FileInputFormat</code>. Searching reveals that paths supplied to <code>FileInputFormat</code>'s <code>addInputPath</code> or <code>setInputPath</code> "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, <code>SparkContext</code> also uses those APIs to set the path. The syntax of the glob includes: <ul> <li> <code>*</code> (match 0 or more character)</li> <li> <code>?</code> (match single character)</li> <li> <code>[ab]</code> (character class)</li> <li> <code>[^ab]</code> (negated character class)</li> <li> <code>[a-b]</code> (character range)</li> <li> <code>{a,b}</code> (alternation)</li> <li> <code>\c</code> (escape character)</li> </ul> Following the example in the accepted answer, it is possible to write your path as: <pre class="prettyprint"><code>sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*") </code></pre> It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary: <pre class="prettyprint"><code>sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*") </code></pre>

How to use regex to include/exclude some input files in sc.textFile?

Tags:

scala

apache-spark

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().

I have attempted to do the following:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following:

/user/Orders/201507270010033.gz /user/Orders/201507300060052.gz

Any idea how to achieve this?

867

asked Aug 03 '15 08:08

eboni

1 Answers

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

176

answered Sep 29 '22 07:09

nhahtdh

Related questions
                            
                                Connection pooling in slick?
                            
                                How to checkpoint DataFrames?
                            
                                How does `isInstanceOf` work?
                            
                                Scala Compiler not found in Intellij IDEA 11 with Play 2.0 project
                            
                                Scala final vs val for concurrency visibility
                            
                                Intellij scala worksheet can't find project classes
                            
                                How can I connect to a postgreSQL database in scala?
                            
                                Why can't I pattern match on Stream.empty in Scala?
                            
                                How to convert Map[A,Future[B]] to Future[Map[A,B]]?
                            
                                Map a Future for both Success and Failure
                            
                                Boot exception when restarting Play
                            
                                In Scala, is there a way to take convert two lists into a Map?
                            
                                How to match a string on a prefix and get the rest?
                            
                                How does the pyspark mapPartitions function work?
                            
                                Scala Option[Future[T]] to Future[Option[T]]
                            
                                A binding to play.api.db.DBApi was already configured, evolutions and injector error with play-slick
                            
                                Filter Map by key set
                            
                                Dropping a nested column from Spark DataFrame
                            
                                Convert Scala List to List with another type
                            
                                How do I convert an Array[String] to a Set[String]?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With