Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex which matches the longer string in an OR

Tags:

python

regex

Motivation

I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:

UK == United Kingdom, 
US == USA == United States,
Korea == South Korea, 

and so on...

Explanation

So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:

germany|us|france|chile

But the problem is with multi-word country names and their shorter versions, like:

Republic of Moldova and Moldova

Using this as example, we have the string:

'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'

What I want to get from this:

'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'

But this is what I get:

'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'

Regex

As there are several cases, here is what I'm using so far:

^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$

As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:

(.*[\d\-]+.*|,.*[:/].*)?

Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:

(republic of moldova|moldova|...)

Question

The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string. Is there a way to tell the regex to choose over the biggest possible match when two values mach?

Edit

  1. I'm using Python with built in re module
  2. As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:

    'Department of Chemistry, National University of Singapore, 4512436 Singapore'

Matches:

'Department of Chemistry, National University of'
'Singapore'

Here it matches too soon now.

like image 226
alfetopito Avatar asked May 18 '13 00:05

alfetopito


People also ask

How do you find the length of a string in regex?

To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.

What does \f mean in regex?

\f stands for form feed, which is a special character used to instruct the printer to start a new page.

What is the difference between () and [] in regex?

In other words, square brackets match exactly one character. (a-z0-9) will match two characters, the first is one of abcdefghijklmnopqrstuvwxyz , the second is one of 0123456789 , just as if the parenthesis weren't there. The () will allow you to read exactly which characters were matched.


1 Answers

Your problem is greediness.

The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.

The simplest solution is to make the repetition ungreedy:

^(.*?)...

Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.

EDIT:

There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:

^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$

Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:

  1. There is no word character before the mandatory space preceding the country.
  2. There is no word character after the country that is separated by nothing but spaces.

Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.

EDIT 2:

As abarnert mentioned in a comment, how about not using a regex-only solution?

You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.

like image 157
Martin Ender Avatar answered Oct 07 '22 01:10

Martin Ender