Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using natural language processing to extract an address from a tweet

I'm building a twitter bot that will listen for tweets like the following:

Hey @twitterbot, I'm looking for restaurants around 123 Main Street, New York

or, another example:

@twitterbot, what's near Yonge & Dundas, Toronto? I'm hungry!

It'll then reply with the kind of data you'd expect these questions to return. I've got most of the problem solved, but I'm stuck on something that shouldn't be so hard; extracting the address from the tweet.

I'll be forwarding the address to a geocoding service to get lat/lng, so I don't need to format or prepare the address in any way; I just need to isolate it from unrelated text like "I'm looking for restaurants around" or "I'm hungry!".

Are there any NLP tools that will perform this address-identification within a block of text? Any suggestions for another way to go about it? Because Google's geocoder handles such a wide array of address formats (even a point of interest like 'The eaton centre, Toronto' counts as an address), I can't use regex to pluck the address out.

Phrased another way, I just want to remove any text that is not part of an address.

The addresses I'm looking for need to work for US/Canada.

There are some similar questions on StackOverflow but none that tackle this exact problem that I could find. Because Google's geocoder is so forgiving, the solution doesn't have to be perfect, it just needs to get rid of enough of the fuzz so that Google knows what I'm trying to say.

I'm very new to NLP so I'd appreciate any guidance on the subject.

like image 405
Joshua Comeau Avatar asked Jul 11 '15 17:07

Joshua Comeau


2 Answers

How to parse freeform street/postal address out of text, and into components answers the question "Is there a way to isolate an address from the text around it and break it into pieces?" -- which is essentially the same question as yours (except that you don't care about breaking it into pieces -- just isolating it from the rest of the text).

SmartyStreets also has a nice demo at https://smartystreets.com/demo?mode=extract , but not a free solution unfortunately.

Another quick thought -- Since twitter posts are limited to 140 characters, and tend to contain few words (your two examples have 9 and 12 words, respectively), you could conceivably just brute-force it. For example, to get the location in "@twitterbot, what's near Yonge & Dundas, Toronto? I'm hungry!", you could send all of the following to the google geocoder --

what's near Yonge & Dundas, Toronto? I'm hungry!

what's near Yonge & Dundas, Toronto? I'm

what's near Yonge & Dundas, Toronto?

what's near Yonge & Dundas,

etc. for all possible substrings composed of complete words.

like image 53
Gabriel Avatar answered Nov 12 '22 14:11

Gabriel


Here you go: http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street%2C+New+York&geoit=xml&parse=1

<geodata>
<latt>40.5119365</latt>
<longt>-74.2493562</longt>
<AreaCode>347,718</AreaCode>
<TimeZone>America/New_York</TimeZone>
<standard>
     <stnumber>123</stnumber>
     <staddress>Main ST</staddress>
     <city>STATEN ISLAND</city>
     <prov>NY</prov>
     <postal>11385</postal>
     <confidence>0.9</confidence>
  </standard>
</geodata>

or http://geocoder.ca/?locate=Hey+%40twitterbot%2C+I%27m+looking+for+restaurants+around+123+Main+Street%2C+New+York

like image 30
Ervin Ruci Avatar answered Nov 12 '22 12:11

Ervin Ruci