Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.2/Jupyter Notebook SQL regexp_extract function not matching regex pattern

I'm using the regexp_extract Spark 2.2 SQL function in a Jupyter (Scala) notebook to match a string of 11 or more repeating characters.

Here's the regex:

^(.)\1{10,}$

Now, let's look at that pattern with the regexp_extract function. Here's how I've used it in my notebook:

spark.sql("SELECT REGEXP_EXTRACT('hhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar").show()

+-------------+
|ExtractedChar|
+-------------+
|             |
+-------------+

Odd, no output. Let's make sure my regex pattern is actually correct. Yep, looks right.

You may be wondering why the regex pattern contains two "\\" characters, it's because it is an escape character so two are necessary. Here's some verification:

1. val string = "SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar"
2. println(string)
SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\1{10,}$', 1) as ExtractedChar

Alright, let's make sure the regexp_extract function is working correctly:

spark.sqlContext.sql("SELECT REGEXP_EXTRACT('TESTING', '^.', 0) as test").show()
+----+
|test|
+----+
|   T|
+----+

Okay, maybe the issue is the Jupyter notebook? After checking and using the Scala REPL, I'm still having the same issue.

Any ideas why I'm unable to get this regex to successfully match?

Edit: Spark SQL is a requirement for this. I could create my own UDF using Scala; however, UDFs are black boxed by Spark meaning they will not be fully optimized.

like image 446
lemon master Avatar asked Oct 13 '25 11:10

lemon master


1 Answers

I found the solution. The SQL string needs to include 4 "\" characters, like so:

'^(.)\\\\1{10,}$'
like image 83
lemon master Avatar answered Oct 15 '25 04:10

lemon master