Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Location, Person name, Date from string by NLTK

I have lots of strings like following,

  1. ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
  2. KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
  3. ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

I am using NLTK to remove the dateline part and recognize the date, location and person name?

Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?

Update:

Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.

Update:

I use ne_chunk. But no luck.

import nltk

def pchunk(t):
    w_tokens = nltk.word_tokenize(t)
    pt = nltk.pos_tag(w_tokens)
    ne = nltk.ne_chunk(pt)
    print ne

# txts is a list of those 3 sentences.
for t in txts:                                            
    print t
    pchunk(t)

Output is following,

ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab

(S
  ISLAMABAD/NNP
  :/:
  Chief/NNP
  Justice/NNP
  (PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
  said/VBD
  that/IN
  (ORGANIZATION National/NNP Accountab/NNP))

KARACHI, July 24 -- Police claimed to have arrested several suspects in separate

(S
  (GPE KARACHI/NNP)
  ,/,
  July/NNP
  24/CD
  --/:
  Police/NNP
  claimed/VBD
  to/TO
  have/VB
  arrested/VBN
  several/JJ
  suspects/NNS
  in/IN
  separate/JJ)

ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

(S
  (GPE ALUM/NN)
  (ORGANIZATION KULAM/NN)
  ,/,
  (PERSON Sri/NNP Lanka/NNP)
  --/:
  As/IN
  gray-bellied/JJ
  clouds/NNS
  started/VBN
  to/TO
  blot/VB
  out/RP
  the/DT
  scorchin/NN)

Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.

like image 979
Shiplu Mokaddim Avatar asked Feb 04 '14 09:02

Shiplu Mokaddim


2 Answers

If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.

enter image description here

Wit will also resolve date/time tokens into normalized dates.

To get started you just have to provide a few examples.

like image 134
Blacksad Avatar answered Oct 21 '22 08:10

Blacksad


Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:

http://developer.yahoo.com/boss/geo/

May also be worth looking at using the dreaded REGEX in order to identify capitals: Regular expression for checking if capital letters are found consecutively in a string?

Good luck!

like image 45
Malcolm Murdoch Avatar answered Oct 21 '22 08:10

Malcolm Murdoch