I have lots of strings like following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
I am using NLTK to remove the dateline part and recognize the date, location and person name?
Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?
Update:
Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.
Update:
I use ne_chunk
. But no luck.
import nltk
def pchunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
# txts is a list of those 3 sentences.
for t in txts:
print t
pchunk(t)
Output is following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
(S
ISLAMABAD/NNP
:/:
Chief/NNP
Justice/NNP
(PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
said/VBD
that/IN
(ORGANIZATION National/NNP Accountab/NNP))
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
(S
(GPE KARACHI/NNP)
,/,
July/NNP
24/CD
--/:
Police/NNP
claimed/VBD
to/TO
have/VB
arrested/VBN
several/JJ
suspects/NNS
in/IN
separate/JJ)
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
(S
(GPE ALUM/NN)
(ORGANIZATION KULAM/NN)
,/,
(PERSON Sri/NNP Lanka/NNP)
--/:
As/IN
gray-bellied/JJ
clouds/NNS
started/VBN
to/TO
blot/VB
out/RP
the/DT
scorchin/NN)
Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.
If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.
Wit will also resolve date/time tokens into normalized dates.
To get started you just have to provide a few examples.
Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:
http://developer.yahoo.com/boss/geo/
May also be worth looking at using the dreaded REGEX in order to identify capitals: Regular expression for checking if capital letters are found consecutively in a string?
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With