I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:
UK == United Kingdom,
US == USA == United States,
Korea == South Korea,
and so on...
So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:
germany|us|france|chile
But the problem is with multi-word country names and their shorter versions, like:
Republic of Moldova
and Moldova
Using this as example, we have the string:
'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'
What I want to get from this:
'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'
But this is what I get:
'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'
As there are several cases, here is what I'm using so far:
^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$
As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:
(.*[\d\-]+.*|,.*[:/].*)?
Also, sometimes the country name comes enclosed in parenthesis, so I have \(?
and \)?
around the second match group, and all the countries go inside it:
(republic of moldova|moldova|...)
The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string. Is there a way to tell the regex to choose over the biggest possible match when two values mach?
As suggested by m.buettner, changing the first matching group from (.*)
to (.*?)
indeed fixes the current issue, but it also creates another. Consider other example:
'Department of Chemistry, National University of Singapore, 4512436 Singapore'
Matches:
'Department of Chemistry, National University of'
'Singapore'
Here it matches too soon now.
To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.
\f stands for form feed, which is a special character used to instruct the printer to start a new page.
In other words, square brackets match exactly one character. (a-z0-9) will match two characters, the first is one of abcdefghijklmnopqrstuvwxyz , the second is one of 0123456789 , just as if the parenthesis weren't there. The () will allow you to read exactly which characters were matched.
Your problem is greediness.
The .*
right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .*
and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .*
matches everything up to Moldova
(so .*
is still consuming Republic of
). And then the alternation (which still cannot match republic of moldova
) will gladly match moldova
and return that as the result.
The simplest solution is to make the repetition ungreedy:
^(.*?)...
Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .*
completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.
EDIT:
There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:
^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$
Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:
Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.
EDIT 2:
As abarnert mentioned in a comment, how about not using a regex-only solution?
You could split the string on ,
, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With