Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract address from raw text using NLTK in python?

I have this text

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

like image 757
ngrj Avatar asked Jun 10 '16 10:06

ngrj


People also ask

How do I find a noun phrase in Python?

noun_phrases() method. With the help of TextBlob. noun_phrases() method, we can get the noun phrases of the sentences by using TextBlob.

What is re in NLTK?

re. match(pattern, string): This method checks for a match in a string. It matches if the defined pattern occurs at the beginning of the string. Trying to match 'Artificial' in 'Artificial Intelligence' will match. Let's see an example.


2 Answers

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

like image 126
Alex Avatar answered Sep 28 '22 18:09

Alex


Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')
like image 25
Bhio Avatar answered Sep 28 '22 20:09

Bhio