Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.lang.IndexOutOfBoundsException: No group 1 | Pyspark

Tags:

regex

pyspark

I'm trying to extract the district of some postcodes using regex with the following script in Pyspark:

postcodes.select("raw_postcode", regexp_extract('raw_postcode', '^[a-zA-Z]+\d\d?[a-zA-Z]?', 1).alias("area")).show(40, False)

I get following exception:

Py4JJavaError: An error occurred while calling o562.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 44, ip-172-31-100-215.eu-west-1.compute.internal, executor 1): java.lang.IndexOutOfBoundsException: No group 1
    at java.util.regex.Matcher.group(Matcher.java:538)

I have tried the regex in Python alone and it works, but it is giving me trouble in pyspark. Help me find out the reason.

like image 229
ebertbm Avatar asked Oct 20 '25 01:10

ebertbm


1 Answers

The second argument to regexp_extract denotes the number of capturing group the contents of which you want to extract. However, your regex has no capturing groups defined, thus, you need to pass 0 as the second argument.

Besides, you may use [0-9] instead of \d to avoid issues with escaping.

So, you may use

postcodes.select("raw_postcode", 
   regexp_extract('raw_postcode', '^[a-zA-Z]+[0-9]{1,2}[a-zA-Z]?', 0).alias("area")
).show(40, False)

Details

  • ^ - start of string
  • [a-zA-Z]+ - 1+ ASCII letters
  • [0-9]{1,2} - 1 or 2 digits
  • [a-zA-Z]? - an optional ASCII letter.
like image 181
Wiktor Stribiżew Avatar answered Oct 22 '25 04:10

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!