Regex which matches the longer string in an OR

Motivation

I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:

UK == United Kingdom, 
US == USA == United States,
Korea == South Korea,

and so on...

Explanation

So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:

germany|us|france|chile

But the problem is with multi-word country names and their shorter versions, like:

Republic of Moldova and Moldova

Using this as example, we have the string:

'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'

What I want to get from this:

'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'

But this is what I get:

'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'

Regex

As there are several cases, here is what I'm using so far:

^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$

As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:

(.*[\d\-]+.*|,.*[:/].*)?

Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:

(republic of moldova|moldova|...)

Question

The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string. Is there a way to tell the regex to choose over the biggest possible match when two values mach?

Edit

I'm using Python with built in re module
As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:

'Department of Chemistry, National University of Singapore, 4512436 Singapore'

Matches:

'Department of Chemistry, National University of'
'Singapore'

Here it matches too soon now.

226

asked May 18 '13 00:05

alfetopito

1 Answers

Your problem is greediness.

The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.

The simplest solution is to make the repetition ungreedy:

^(.*?)...

Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.

EDIT:

There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:

^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$

Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:

There is no word character before the mandatory space preceding the country.
There is no word character after the country that is separated by nothing but spaces.

Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.

EDIT 2:

As abarnert mentioned in a comment, how about not using a regex-only solution?

You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.

157

answered Oct 07 '22 01:10

Martin Ender

Related questions
                            
                                segmented linear regression in python
                            
                                How Can I Downgrade from Python 3.2 to 2.7?
                            
                                Django: Display values of the selected multiple choice field in a template
                            
                                pandas reading csv orientation
                            
                                Image resize using PIL changes colors drastically
                            
                                PGP-signing multipart e-mails with Python
                            
                                How to change the default version of python in a linux machine ?(not just symlink) [closed]
                            
                                Using git to Track changes to dropbox?
                            
                                matplotlib: faster PDF generation?
                            
                                using python urllib2 to send POST request and get response
                            
                                Python/Regex - Match .#,#. in String
                            
                                Can not get simplest pipeline example to work in scrapy
                            
                                Set "publish to web" in Google spreadsheet using Drive python API
                            
                                Drawing SVG on Kivy canvas
                            
                                strip a verbose python regex
                            
                                Pricing a Floating Bond in quantlib using Python
                            
                                Transfer ownership of numpy data
                            
                                sqlalchemy: connect to MySQL without password
                            
                                efficient numpy.fromfile on zipped files?
                            
                                Is there a way to avoid this memory error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex which matches the longer string in an OR

Tags:

python

regex