Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Python split by a character yet maintain that character?

Google Maps results are often displayed thus:

enter image description here

'\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'

Another variation:

enter image description here

'Clayton Village Shopping Center, 14856 Clayton Rd\nChesterfield, MO, United States\n(636) 227-2844'

And another:

enter image description here

'Wildwood, MO\nUnited States\n(636) 458-7707'

Notice the variation in the placement of the \n characters.

I'm looking to extract the first X lines as address, and the last line as phone number. A regex such as (.*\n.*)\n(.*) would suffice for the first example, but falls short for the other two. The only thing I can rely on is that the phone number will be in the form (ddd) ddd-dddd.

I think a regex that will allow for each and every possible variation will be hard to come by. Is it possible to use split(), but maintain the character we have split by? So in this example, split by "(", to split out the address and phone number, but retain this character in the phone number? I could concatenate the "(" back into split("(")[1], but is there a neater way?

like image 887
Pyderman Avatar asked Jun 29 '15 03:06

Pyderman


2 Answers

Don't use regex. Just split the string on the '\n'. The last index is a phone number, the other indexes are the address.

lines   = inputString.split('\n')
phone   = lines[-1] if lines[-1].match(REGEX_PHONE_US) else None
address = '\n'.join(lines[:-1]) if phone else inputString

Python has a lot of great built in tools for manipulating strings in a more... human way... than regex allows.

like image 93
ArtOfWarfare Avatar answered Sep 25 '22 18:09

ArtOfWarfare


If I understand you correctly, you want to "extract the first X lines as address". Assuming that all the addresses you need are in the US this regex code should work for you. In any case, it works on the 3 examples you provided:

import re
x = 'Wildwood, MO\nUnited States\n(636) 458-7707'
print re.findall(r'.*\n+.*\States', x)

The output is:

['Wildwood, MO\nUnited States']

If you want to print it later without the \n you can do it this way:

x = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
y = re.findall(r'.*\n+.*\States', x)
y = y[0].rstrip()

When you print y the output:

113 W 5th St
Eureka, MO, United States

And, if you want to extract the phone number separately you can do this:

tel = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
num = re.findall(r'.*\d+\-\d+', tel)
num = num[0].rstrip()

When you print num the output:

(636) 938-9310
like image 22
Joe T. Boka Avatar answered Sep 25 '22 18:09

Joe T. Boka