Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx for matching UK Postcodes

People also ask

Which type of validation check could be used for a field containing a postcode?

ONSPD can be reliably used as a postcode validation service so long as data currency is not a prohibitive factor.

How many UK postcodes are there?

In the UK there are 124 postcode areas e.g AB. This consists of 121 UK postcodes and 3 postcodes which apply to the Crown Dependencies of Jersey, Guernsey and the Isle of Man. Each postcode area is divided into postcode districts e.g. AB12 and each postcode district is divided into postcode sectors e.g. AB12 1.

How do I verify an Eircode?

Currently, the only means to validate an Eircode is either manually via finder.eircode.ie (limited to 15 uses per day!) or by purchasing access to either the ECAF or ECAD databases... which is overkill for a simple Yes/No validation and they are much too expensive.


I'd recommend taking a look at the UK Government Data Standard for postcodes [link now dead; archive of XML, see Wikipedia for discussion]. There is a brief description about the data and the attached xml schema provides a regular expression. It may not be exactly what you want but would be a good starting point. The RegEx differs from the XML slightly, as a P character in third position in format A9A 9AA is allowed by the definition given.

The RegEx supplied by the UK Government was:

([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})

As pointed out on the Wikipedia discussion, this will allow some non-real postcodes (e.g. those starting AA, ZY) and they do provide a more rigorous test that you could try.


I recently posted an answer to this question on UK postcodes for the R language. I discovered that the UK Government's regex pattern is incorrect and fails to properly validate some postcodes. Unfortunately, many of the answers here are based on this incorrect pattern.

I'll outline some of these issues below and provide a revised regular expression that actually works.


Note

My answer (and regular expressions in general):

  • Only validates postcode formats.
  • Does not ensure that a postcode legitimately exists.
    • For this, use an appropriate API! See Ben's answer for more info.

If you don't care about the bad regex and just want to skip to the answer, scroll down to the Answer section.

The Bad Regex

The regular expressions in this section should not be used.

This is the failing regex that the UK government has provided developers (not sure how long this link will be up, but you can see it in their Bulk Data Transfer documentation):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

Problems

Problem 1 - Copy/Paste

See regex in use here.

As many developers likely do, they copy/paste code (especially regular expressions) and paste them expecting them to work. While this is great in theory, it fails in this particular case because copy/pasting from this document actually changes one of the characters (a space) into a newline character as shown below:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))
[0-9][A-Za-z]{2})$

The first thing most developers will do is just erase the newline without thinking twice. Now the regex won't match postcodes with spaces in them (other than the GIR 0AA postcode).

To fix this issue, the newline character should be replaced with the space character:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                     ^

Problem 2 - Boundaries

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^                     ^ ^                                                                                                                                            ^^

The postcode regex improperly anchors the regex. Anyone using this regex to validate postcodes might be surprised if a value like fooA11 1AA gets through. That's because they've anchored the start of the first option and the end of the second option (independently of one another), as pointed out in the regex above.

What this means is that ^ (asserts position at start of the line) only works on the first option ([Gg][Ii][Rr] 0[Aa]{2}), so the second option will validate any strings that end in a postcode (regardless of what comes before).

Similarly, the first option isn't anchored to the end of the line $, so GIR 0AAfoo is also accepted.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

To fix this issue, both options should be wrapped in another group (or non-capturing group) and the anchors placed around that:

^(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$
^^                                                                                                                                                                      ^^

Problem 3 - Improper Character Set

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                       ^^

The regex is missing a - here to indicate a range of characters. As it stands, if a postcode is in the format ANA NAA (where A represents a letter and N represents a number), and it begins with anything other than A or Z, it will fail.

That means it will match A1A 1AA and Z1A 1AA, but not B1A 1AA.

To fix this issue, the character - should be placed between the A and Z in the respective character set:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                        ^

Problem 4 - Wrong Optional Character Set

See regex in use here.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                        ^

I swear they didn't even test this thing before publicizing it on the web. They made the wrong character set optional. They made [0-9] option in the fourth sub-option of option 2 (group 9). This allows the regex to match incorrectly formatted postcodes like AAA 1AA.

To fix this issue, make the next character class optional instead (and subsequently make the set [0-9] match exactly once):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) [0-9][A-Za-z]{2})$
                                                                                                                                                ^

Problem 5 - Performance

Performance on this regex is extremely poor. First off, they placed the least likely pattern option to match GIR 0AA at the beginning. How many users will likely have this postcode versus any other postcode; probably never? This means every time the regex is used, it must exhaust this option first before proceeding to the next option. To see how performance is impacted check the number of steps the original regex took (35) against the same regex after having flipped the options (22).

The second issue with performance is due to the way the entire regex is structured. There's no point backtracking over each option if one fails. The way the current regex is structured can greatly be simplified. I provide a fix for this in the Answer section.

Problem 6 - Spaces

See regex in use here

This may not be considered a problem, per se, but it does raise concern for most developers. The spaces in the regex are not optional, which means the users inputting their postcodes must place a space in the postcode. This is an easy fix by simply adding ? after the spaces to render them optional. See the Answer section for a fix.


Answer

1. Fixing the UK Government's Regex

Fixing all the issues outlined in the Problems section and simplifying the pattern yields the following, shorter, more concise pattern. We can also remove most of the groups since we're validating the postcode as a whole (not individual parts):

See regex in use here

^([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$

This can further be shortened by removing all of the ranges from one of the cases (upper or lower case) and using a case-insensitive flag. Note: Some languages don't have one, so use the longer one above. Each language implements the case-insensitivity flag differently.

See regex in use here.

^([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2}|GIR ?0A{2})$

Shorter again replacing [0-9] with \d (if your regex engine supports it):

See regex in use here.

^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

2. Simplified Patterns

Without ensuring specific alphabetic characters, the following can be used (keep in mind the simplifications from 1. Fixing the UK Government's Regex have also been applied here):

See regex in use here.

^([A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

And even further if you don't care about the special case GIR 0AA:

^[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}$

3. Complicated Patterns

I would not suggest over-verification of a postcode as new Areas, Districts and Sub-districts may appear at any point in time. What I will suggest potentially doing, is added support for edge-cases. Some special cases exist and are outlined in this Wikipedia article.

Here are complex regexes that include the subsections of 3. (3.1, 3.2, 3.3).

In relation to the patterns in 1. Fixing the UK Government's Regex:

See regex in use here

^(([A-Z][A-HJ-Y]?\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

And in relation to 2. Simplified Patterns:

See regex in use here

^(([A-Z]{1,2}\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

3.1 British Overseas Territories

The Wikipedia article currently states (some formats slightly simplified):

  • AI-1111: Anguila
  • ASCN 1ZZ: Ascension Island
  • STHL 1ZZ: Saint Helena
  • TDCU 1ZZ: Tristan da Cunha
  • BBND 1ZZ: British Indian Ocean Territory
  • BIQQ 1ZZ: British Antarctic Territory
  • FIQQ 1ZZ: Falkland Islands
  • GX11 1ZZ: Gibraltar
  • PCRN 1ZZ: Pitcairn Islands
  • SIQQ 1ZZ: South Georgia and the South Sandwich Islands
  • TKCA 1ZZ: Turks and Caicos Islands
  • BFPO 11: Akrotiri and Dhekelia
  • ZZ 11 & GE CX: Bermuda (according to this document)
  • KY1-1111: Cayman Islands (according to this document)
  • VG1111: British Virgin Islands (according to this document)
  • MSR 1111: Montserrat (according to this document)

An all-encompassing regex to match only the British Overseas Territories might look like this:

See regex in use here.

^((ASCN|STHL|TDCU|BBND|[BFS]IQQ|GX\d{2}|PCRN|TKCA) ?\d[A-Z]{2}|(KY\d|MSR|VG|AI)[ -]?\d{4}|(BFPO|[A-Z]{2}) ?\d{2}|GE ?CX)$

3.2 British Forces Post Office

Although they've been recently changed it to better align with the British postcode system to BF# (where # represents a number), they're considered optional alternative postcodes. These postcodes follow(ed) the format of BFPO, followed by 1-4 digits:

See regex in use here

^BFPO ?\d{1,4}$

3.3 Santa?

There's another special case with Santa (as mentioned in other answers): SAN TA1 is a valid postcode. A regex for this is very simply:

^SAN ?TA1$

It looks like we're going to be using ^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$, which is a slightly modified version of that sugested by Minglis above.

However, we're going to have to investigate exactly what the rules are, as the various solutions listed above appear to apply different rules as to which letters are allowed.

After some research, we've found some more information. Apparently a page on 'govtalk.gov.uk' points you to a postcode specification govtalk-postcodes. This points to an XML schema at XML Schema which provides a 'pseudo regex' statement of the postcode rules.

We've taken that and worked on it a little to give us the following expression:

^((GIR &0AA)|((([A-PR-UWYZ][A-HK-Y]?[0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]))) &[0-9][ABD-HJLNP-UW-Z]{2}))$

This makes spaces optional, but does limit you to one space (replace the '&' with '{0,} for unlimited spaces). It assumes all text must be upper-case.

If you want to allow lower case, with any number of spaces, use:

^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$

This doesn't cover overseas territories and only enforces the format, NOT the existence of different areas. It is based on the following rules:

Can accept the following formats:

  • “GIR 0AA”
  • A9 9ZZ
  • A99 9ZZ
  • AB9 9ZZ
  • AB99 9ZZ
  • A9C 9ZZ
  • AD9E 9ZZ

Where:

  • 9 can be any single digit number.
  • A can be any letter except for Q, V or X.
  • B can be any letter except for I, J or Z.
  • C can be any letter except for I, L, M, N, O, P, Q, R, V, X, Y or Z.
  • D can be any letter except for I, J or Z.
  • E can be any of A, B, E, H, M, N, P, R, V, W, X or Y.
  • Z can be any letter except for C, I, K, M, O or V.

Best wishes

Colin


There is no such thing as a comprehensive UK postcode regular expression that is capable of validating a postcode. You can check that a postcode is in the correct format using a regular expression; not that it actually exists.

Postcodes are arbitrarily complex and constantly changing. For instance, the outcode W1 does not, and may never, have every number between 1 and 99, for every postcode area.

You can't expect what is there currently to be true forever. As an example, in 1990, the Post Office decided that Aberdeen was getting a bit crowded. They added a 0 to the end of AB1-5 making it AB10-50 and then created a number of postcodes in between these.

Whenever a new street is build a new postcode is created. It's part of the process for obtaining permission to build; local authorities are obliged to keep this updated with the Post Office (not that they all do).

Furthermore, as noted by a number of other users, there's the special postcodes such as Girobank, GIR 0AA, and the one for letters to Santa, SAN TA1 - you probably don't want to post anything there but it doesn't appear to be covered by any other answer.

Then, there's the BFPO postcodes, which are now changing to a more standard format. Both formats are going to be valid. Lastly, there's the overseas territories source Wikipedia.

+----------+----------------------------------------------+
| Postcode |                   Location                   |
+----------+----------------------------------------------+
| AI-2640  | Anguilla                                     |
| ASCN 1ZZ | Ascension Island                             |
| STHL 1ZZ | Saint Helena                                 |
| TDCU 1ZZ | Tristan da Cunha                             |
| BBND 1ZZ | British Indian Ocean Territory               |
| BIQQ 1ZZ | British Antarctic Territory                  |
| FIQQ 1ZZ | Falkland Islands                             |
| GX11 1AA | Gibraltar                                    |
| PCRN 1ZZ | Pitcairn Islands                             |
| SIQQ 1ZZ | South Georgia and the South Sandwich Islands |
| TKCA 1ZZ | Turks and Caicos Islands                     |
+----------+----------------------------------------------+

Next, you have to take into account that the UK "exported" its postcode system to many places in the world. Anything that validates a "UK" postcode will also validate the postcodes of a number of other countries.

If you want to validate a UK postcode the safest way to do it is to use a look-up of current postcodes. There are a number of options:

  • Ordnance Survey releases Code-Point Open under an open data licence. It'll be very slightly behind the times but it's free. This will (probably - I can't remember) not include Northern Irish data as the Ordnance Survey has no remit there. Mapping in Northern Ireland is conducted by the Ordnance Survey of Northern Ireland and they have their, separate, paid-for, Pointer product. You could use this and append the few that aren't covered fairly easily.

  • Royal Mail releases the Postcode Address File (PAF), this includes BFPO which I'm not sure Code-Point Open does. It's updated regularly but costs money (and they can be downright mean about it sometimes). PAF includes the full address rather than just postcodes and comes with its own Programmers Guide. The Open Data User Group (ODUG) is currently lobbying to have PAF released for free, here's a description of their position.

  • Lastly, there's AddressBase. This is a collaboration between Ordnance Survey, Local Authorities, Royal Mail and a matching company to create a definitive directory of all information about all UK addresses (they've been fairly successful as well). It's paid-for but if you're working with a Local Authority, government department, or government service it's free for them to use. There's a lot more information than just postcodes included.


^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$

Regular expression to match valid UK postcodes. In the UK postal system not all letters are used in all positions (the same with vehicle registration plates) and there are various rules to govern this. This regex takes into account those rules. Details of the rules: First half of postcode Valid formats [A-Z][A-Z][0-9][A-Z] [A-Z][A-Z][0-9][0-9] [A-Z][0-9][0-9] [A-Z][A-Z][0-9] [A-Z][A-Z][A-Z] [A-Z][0-9][A-Z] [A-Z][0-9] Exceptions Position - First. Contraint - QVX not used Position - Second. Contraint - IJZ not used except in GIR 0AA Position - Third. Constraint - AEHMNPRTVXY only used Position - Forth. Contraint - ABEHMNPRVWXY Second half of postcode Valid formats [0-9][A-Z][A-Z] Exceptions Position - Second and Third. Contraint - CIKMOV not used

http://regexlib.com/REDetails.aspx?regexp_id=260


I had a look into some of the answers above and I'd recommend against using the pattern from @Dan's answer (c. Dec 15 '10), since it incorrectly flags almost 0.4% of valid postcodes as invalid, while the others do not.

Ordnance Survey provide service called Code Point Open which:

contains a list of all the current postcode units in Great Britain

I ran each of the regexs above against the full list of postcodes (Jul 6 '13) from this data using grep:

cat CSV/*.csv |
    # Strip leading quotes
    sed -e 's/^"//g' |
    # Strip trailing quote and everything after it
    sed -e 's/".*//g' |
    # Strip any spaces
    sed -E -e 's/ +//g' |
    # Find any lines that do not match the expression
    grep --invert-match --perl-regexp "$pattern"

There are 1,686,202 postcodes total.

The following are the numbers of valid postcodes that do not match each $pattern:

'^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$'
# => 6016 (0.36%)
'^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$'
# => 0
'^GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|BX|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\d[\dA-Z]?[ ]?\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\d{1,4}$'
# => 0

Of course, these results only deal with valid postcodes that are incorrectly flagged as invalid. So:

'^.*$'
# => 0

I'm saying nothing about which pattern is the best regarding filtering out invalid postcodes.


According to this Wikipedia table

enter image description here

This pattern cover all the cases

(?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})

When using it on Android\Java use \\d