Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FInd a US street address in text (preferably using Python regex)

Disclaimer: I read very carefully this thread: Street Address search in a string - Python or Ruby and many other resources.

Nothing works for me so far.

In some more details here is what I am looking for is:

The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:

a) Street number (1...N digits);

b) Street name : one or more words capitalized;

b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."

c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters

d) Street "type": one of ("st.", "ave.", "way");

e) City name : 1 or more Capitalized words;

f) (optional) state abbreviation (2 letters)

g) (optional) zip which is any 5 digits.

None of the above needs to be a valid thing (e.g. an existing city or zip).

I am trying expressions like these so far:

pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)

>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")

Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?

Anyhow: here is an example of what I am hoping to get: Given def ex_addr(text): # does the re magic # returns 1st address (all addresses?) or None if nothing found

for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',

'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver avenue in Ottawa? \nThanks!!!',

'This was written in 1999 in Montreal',

"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",

"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)

I would like to get:

'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"

Could you please help?

like image 299
bzdjamboo Avatar asked Aug 21 '13 21:08

bzdjamboo


2 Answers

I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.

https://github.com/madisonmay/CommonRegex

Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'

like image 99
ccdpowell Avatar answered Oct 13 '22 01:10

ccdpowell


\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?

In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.

I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches. You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.

In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

like image 20
remram Avatar answered Oct 12 '22 23:10

remram