I have this text <blockquote> '''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678''' </blockquote> . How the address part can be extracted from the above text using NLTK? I have tried <code>Stanford NER Tagger</code>, which gives me only <code>New York</code> as Location. How to solve this?

Definitely regular expressions :) Something like <pre class="prettyprint"><code>import re txt = ... regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}" address = re.findall(regexp, txt) # address = ['44 West 22nd Street, New York, NY 12345'] </code></pre> Explanation: <code>[0-9]{1,3}</code>: 1 to 3 digits, the address number <code>(space)</code>: a space between the number and the street name <code>.+</code>: street name, any character for any number of occurrences <code>,</code>: a comma and a space before the city <code>.+</code>: city, any character for any number of occurrences <code>,</code>: a comma and a space before the state <code>[A-Z]{2}</code>: exactly 2 uppercase chars from A to Z <code>[0-9]{5}</code>: 5 digits <code>re.findall(expr, string)</code> will return an array with all the occurrences found.

Pyap works best not just for this particular example but also for other addresses contained in texts. <pre class="prettyprint"><code>text = ... addresses = pyap.parse(text, country='US') </code></pre>

How can I extract address from raw text using NLTK in python?

2 Answers

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

126

answered Sep 28 '22 18:09

Alex

Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')

answered Sep 28 '22 20:09

Bhio

Related questions
                            
                                except-clause deletes local variable
                            
                                Using Python Requests to send file and JSON in single request
                            
                                How to get all the keys with the same highest value?
                            
                                Python dump json with accents [duplicate]
                            
                                Automatically strip() all values in WTForms?
                            
                                How can I decorate a Python unittest method to skip if a property I've previously evaluated isn't True?
                            
                                Elegant way to make logging.LoggerAdapter available to other modules
                            
                                How to better fit seaborn violinplots?
                            
                                How to get binary representation of negative numbers in python [duplicate]
                            
                                How we can use iter_rows() in Python openpyxl package?
                            
                                Verify the error code or message from SystemExit in pytest
                            
                                Python change Accept-Language using requests
                            
                                'Permission denied' error when using pip install in virtualenv
                            
                                modifying python bytecode
                            
                                Python iterate over stdin line by line using input()
                            
                                Bulk update with Python's elasticsearch client
                            
                                Using a .bat to change directories and run Jupyter
                            
                                Multi POST query (session mode)
                            
                                Run same python code in two terminals, will them interfere each other?
                            
                                How can I specify a python version using setuptools? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I extract address from raw text using NLTK in python?

Tags:

python

nltk

stanford-nlp

street-address

ngrj

People also ask

2 Answers

Alex

Bhio

Recent Activity

Donate For Us