I am working on a project that I am not exactly sure how to approach. The problem can be summarized as following:
Geographic locations range from states to counties(all within US), so their number is limited, but each report generally contains references to multiple locations. For example, first 5 paragraphs of report might be about a state as a whole, and then then next 5 would be about individual counties within that state, or something like that.
I am curious what would be the best way of approaching a problem like that, perhaps with a specific recommendation in terms of NLP or ML frameworks(Python or Java)?
I may actually be able to help a little here (my research is in the area of Toponym Resolution).
If I understand you correctly, you are looking for a way to (1) find the place names in the text, (2) disambiguate the place name's geographic reference, and (3) spatially ground whole sentences or paragraphs.
There are a lot of open source packages that can do #1. Stanford Core NLP, OpenNLP
There are a few packages that can do #1 and #2. CLAVIN is probably the only ready to use open source application that can do this at the moment. Yahoo Placemaker costs money but can do it.
There isn't really a package that can do #3. There is a newer project called TEXTGROUNDER doing something called "Document Geolocation", but while the code is available it is not set up be run on your own input texts. I only recommend you look at it if you are itching to either start or contribute to a project trying to do something like this.
All three tasks are still part of ongoing research and can get incredibly complicated depending on the details of the source text. You didn't provide much detail about your texts, but hopefully this information can help you.
Old question but it may be useful for others to know that Apache OpenNLP has an addon called the GeoEntityLinker and takes document text and sentences, extracts entities (toponymns), performs lookup on the USGS and GeoNames gazateers (Lucene indexes), resolves (or attempts to at least) the topopnymns in several ways, and returns you the scored gazateer entries in relation to each sentence in the document passed in. It will be released with OpenNLP 1.6 if all goes well.... not much documentation if any at this point.
This is the ticket in OpenNLP Jira: https://issues.apache.org/jira/i#browse/OPENNLP-579.
this is the source code:
http://svn.apache.org/viewvc/opennlp/addons/geoentitylinker-addon/
FYI: I am the main committer working on it.
Identifying mentions of geographic locations is rather trivial using OpenNLP or GATE etc. The main problem comes afterwards, when you have to disambiguate places with the same name. For example, in the US there are 29 places named "Bristol". Which one is the correct?
There are several approaches you can use to disambiguate. A simple one is to gather the list of all location mentioned in the text, get their potential longitude/latitudes and then find the set that has the minimum sum of distances.
A better solution that I have seen people deploying is get from Wikipedia all articles that refer to places, put them in a DB for text like Lucene, and then use your text as query to find the most promising location between candidates by measuring some similarity score. The idea, is that in the article except the word "Bristol" also a river name, a person, or something of similar will be mentioned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With