Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r Regular expression for extracting UK postcode from an address is not ordered

I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.

Here is my function:

address_to_postcode <- function(addresses) {

  # 1. Convert addresses to upper case
  addresses = toupper(addresses)

  # 2. Regular expression for UK postcodes:
  pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"

  # 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
  present <- grepl(pcd_regex, addresses)

  # 4. Extract postcodes matching the regular expression for a valid UK postcode
  postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))

  # 5. Return NA where an address does not contain a (valid format) UK postcode
  postcodes_out <- list()
  postcodes_out[present] <- postcodes
  postcodes_out[!present] <- NA

  # 6. Return the results in a vector (should be same length as input vector)
  return(do.call(c, postcodes_out))
}

According to the guidance document, the logic this regular expression looks for is as follows:

"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one or two numbers OR One letter followed by one number and then another letter OR A two part post code where the first part must be One letter followed by a second letter that must be one of ABCDEFGH JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that AND The second part (separated by a space from the first part) must be One number followed by two letters. A combination of upper and lower case characters is allowed. Note: the length is determined by the regular expression and is between 2 and 8 characters.

My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.

Consider the following example:

> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"

According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':

> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"

... whereas in this case I would expect the output to be NA.

Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:

> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE

Two questions:

  1. Have I misunderstood the logic of the regular expression and
  2. If not, how can I correct it (i.e. why aren't the specified letter and character ranges exclusive to their position within the regular expression)?
like image 865
Amy M Avatar asked Dec 08 '22 14:12

Amy M


1 Answers

Edit

Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.


Note

Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.


Issues

You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.

1. The space character

My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                here ^

2. Boundaries

You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.

See regex in use here

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^                                                                                                                                                                      ^^^

3. Character class oversight

There's a missing - in the character class as pointed out by @deadcrab in his answer here.

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
                                                                                           ^

4. They made the wrong character class optional!

In the documentation it clearly states:

A two part post code where the first part must be:

  • One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that

They made the wrong character class optional!

\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
                                                                                                                                        ^^^^^^
                                                                                                                        it should be this one ^^^^^^^^

5. The whole thing is just awful...

There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.

\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b

Answer

As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:

\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
                                             ^^                             ^^

You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):

\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b

Note

Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.

The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):

  • British Overseas Territories
  • British Forces Post Office
    • Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
  • Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)

See this regex in use here.

\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b

I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).


Side note

The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.

As mentioned in the Issues section, they should have:

  1. Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
  2. The regular expression is also missing - in one of the character classes
  3. It also made the wrong character class optional.
like image 100
ctwheels Avatar answered Jan 21 '23 09:01

ctwheels