Google Maps results are often displayed thus:
'\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
Another variation:
'Clayton Village Shopping Center, 14856 Clayton Rd\nChesterfield, MO, United States\n(636) 227-2844'
And another:
'Wildwood, MO\nUnited States\n(636) 458-7707'
Notice the variation in the placement of the \n
characters.
I'm looking to extract the first X lines as address, and the last line as phone number. A regex such as (.*\n.*)\n(.*)
would suffice for the first example, but falls short for the other two. The only thing I can rely on is that the phone number will be in the form (ddd) ddd-dddd
.
I think a regex that will allow for each and every possible variation will be hard to come by. Is it possible to use split()
, but maintain the character we have split by? So in this example, split by "("
, to split out the address and phone number, but retain this character in the phone number? I could concatenate the "("
back into split("(")[1]
, but is there a neater way?
Don't use regex. Just split the string on the '\n'
. The last index is a phone number, the other indexes are the address.
lines = inputString.split('\n')
phone = lines[-1] if lines[-1].match(REGEX_PHONE_US) else None
address = '\n'.join(lines[:-1]) if phone else inputString
Python has a lot of great built in tools for manipulating strings in a more... human way... than regex allows.
If I understand you correctly, you want to "extract the first X lines as address". Assuming that all the addresses you need are in the US this regex code should work for you. In any case, it works on the 3 examples you provided:
import re
x = 'Wildwood, MO\nUnited States\n(636) 458-7707'
print re.findall(r'.*\n+.*\States', x)
The output is:
['Wildwood, MO\nUnited States']
If you want to print it later without the \n
you can do it this way:
x = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
y = re.findall(r'.*\n+.*\States', x)
y = y[0].rstrip()
When you print y
the output:
113 W 5th St
Eureka, MO, United States
And, if you want to extract the phone number separately you can do this:
tel = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
num = re.findall(r'.*\d+\-\d+', tel)
num = num[0].rstrip()
When you print num
the output:
(636) 938-9310
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With