java.lang.IndexOutOfBoundsException: No group 1 | Pyspark

Question

I'm trying to extract the district of some postcodes using regex with the following script in Pyspark:

postcodes.select("raw_postcode", regexp_extract('raw_postcode', '^[a-zA-Z]+\d\d?[a-zA-Z]?', 1).alias("area")).show(40, False)

I get following exception:

Py4JJavaError: An error occurred while calling o562.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 44, ip-172-31-100-215.eu-west-1.compute.internal, executor 1): java.lang.IndexOutOfBoundsException: No group 1
    at java.util.regex.Matcher.group(Matcher.java:538)

I have tried the regex in Python alone and it works, but it is giving me trouble in pyspark. Help me find out the reason.

Wiktor Stribiżew · Accepted Answer

The second argument to regexp_extract denotes the number of capturing group the contents of which you want to extract. However, your regex has no capturing groups defined, thus, you need to pass 0 as the second argument.

Besides, you may use [0-9] instead of \d to avoid issues with escaping.

So, you may use

postcodes.select("raw_postcode", 
   regexp_extract('raw_postcode', '^[a-zA-Z]+[0-9]{1,2}[a-zA-Z]?', 0).alias("area")
).show(40, False)

Details

^ - start of string
[a-zA-Z]+ - 1+ ASCII letters
[0-9]{1,2} - 1 or 2 digits
[a-zA-Z]? - an optional ASCII letter.

java.lang.IndexOutOfBoundsException: No group 1 | Pyspark

Tags:

regex

pyspark

ebertbm

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

java.lang.IndexOutOfBoundsException: No group 1 | Pyspark

Tags:

regex

pyspark

ebertbm

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us