Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for splitting a german address into its parts

Good evening,

I'm trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like the following:

Name der Straße 25a 88489 Teststadt
to
Name der Straße|25a|88489|Teststadt

or

Teststr. 3 88489 Beispielort (Großer Kreis)
to
Teststr.|3|88489|Beispielort (Großer Kreis)

It would be perfect if the system / regex would still work if parts like the zip code or the city are missing.

Is there any regex or library out there with which I could archive this?

EDIT: Rule for german addresses:
Street: Characters, numbers and spaces
House no: Number and any characters (or space) until a series of numbers (zip) (at least in these examples)
Zip: 5 digits
Place or City: The rest maybe also with spaces, commas or braces

like image 281
Christian Kolb Avatar asked Mar 25 '12 20:03

Christian Kolb


4 Answers

I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:

/^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i

Here are some example matches.

It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.

[a-zäöüß\s\d,.-]+?                         # Street name (lazy)
[\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?     # Street number (optional)

After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.

like image 83
F.P Avatar answered Oct 03 '22 08:10

F.P


I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.

The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range. Anything before that is the street name.

Anyway, here we go:

^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$

This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.

Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.

As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.

Still, here’s a somewhat more legible regular expression:

^
(?<street>(?:\p{L}|\ |\d|\.|-)+?)\ 
(?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\ 
(?<zip>\d{5})\ 
(?<city>(?:\p{L}|\ |-)+)
(?:\ *\((?<suffix>[^\)]+)\))?
$

In Java 7, the closest we can achieve is this (untested; may contain typos):

String pattern =
    "^" +
    "(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
    "(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
    "(?<zip>\\d{5}) " +
    "(?<city>(?:\\p{L}| |-)+)" +
    "(?: *\\((?<suffix>[^\\)]+)\\))?" +
    "$";
like image 41
Konrad Rudolph Avatar answered Oct 03 '22 06:10

Konrad Rudolph


Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.

Regex Pattern:

^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
  • Group 1: Street
  • Group 2: House no.
  • Group 3: ZIP
  • Group 4: City
like image 22
Michael Schmeißer Avatar answered Oct 03 '22 08:10

Michael Schmeißer


public static void main(String[] args) {
    String data = "Name der Strase 25a 88489 Teststadt";
    String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";

    Pattern pattern = Pattern.compile(regexp);
    Matcher matcher = pattern.matcher(data);
    boolean matchFound = matcher.find();

    if (matchFound) {
        // Get all groups for this match
        for (int i=0; i<=matcher.groupCount(); i++) {
            String groupStr = matcher.group(i);
            System.out.println(groupStr);
        }
    }System.out.println("nothing found");
                }

I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.

I recommend to visit this it's a great site about regular expressions. Good luck!

like image 40
Radek Busz Avatar answered Oct 03 '22 08:10

Radek Busz