Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse text to get the proper nouns (names and organizations) - python nltk

Tags:

python

nltk

I am trying to extract proper nouns as in Names and Organization names from very small chunks of texts like sms, the basic parsers available with nltk Finding Proper Nouns using NLTK WordNet are being able to get the nouns but the problem is when we get proper nouns not starting with a capital letter , for texts like this the names like sumit do not get recognized as proper nouns

>>> sentence = "i spoke with sumit and rajesh and Samit about the gridlock situation last night @ around 8 pm last nite"
>>> tagged_sent = pos_tag(sentence.split())
>>> print tagged_sent
[('i', 'PRP'), ('spoke', 'VBP'), ('with', 'IN'), **('sumit', 'NN')**, ('and', 'CC'), ('rajesh', 'JJ'), ('and', 'CC'), **('Samit', 'NNP'),** ('about', 'IN'), ('the', 'DT'), ('gridlock', 'NN'), ('situation', 'NN'), ('last', 'JJ'), ('night', 'NN'), ('@', 'IN'), ('around', 'IN'), ('8', 'CD'), ('pm', 'NN'), ('last', 'JJ'), ('nite', 'NN')]
like image 835
Brij Raj Singh - MSFT Avatar asked Oct 21 '13 12:10

Brij Raj Singh - MSFT


2 Answers

There is a better way to extract names of people and organizations

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(sentence)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

However all Named Entity Recognizers commit errors. If you really don't want to miss any proper name, you could use a dict of Proper Names and check if the name is contained in the dict.

like image 151
user278064 Avatar answered Sep 22 '22 21:09

user278064


You might want to have a look at python-nameparser. It tries to guess capitalization of names also. Sorry for the incomplete answer but I don't have much experience using python-nameparser.

Best of luck!

like image 28
Saheel Godhane Avatar answered Sep 20 '22 21:09

Saheel Godhane