I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']
print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged
this gives me
John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
so clearly, things like Short
and Term
were tagged as NNP
. The data that i have contains many such instances where non NNP
words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.
How can i parse or clean up the data so that i can detect a non NNP
term even though it may be capitalized? I dont want terms like Short
and Term
to be categorized as NNP
Also, not sure why John Donk
was captured as a person but Brian Jones
was not. Could it be due to the other capitalized non NNP
s in my data? Could that be having an effect on how the StanfordNERTagger
treats everything else?
Update, one possible solution
Here is what i plan to do
NNP
then we know that the original word must also be an NNP
Here is what i tried to do
str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
wl = word.lower()
print wl
w,pos = stp.tag(wl)
print pos
if pos=="NNP":
print "Got NNP"
print w
but this gives me error
John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
File "X:\crp.py", line 37, in <module>
w,pos = stp.tag(wl)
ValueError: too many values to unpack
i have tried multiple approaches but some error always shows up. How can i Tag a single word?
I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger
returns an empty string
Alternatively, you can use NLTK ne_chunk but it doesn't seem to do much other unless you are concerned about what kind of Proper Noun you get from the sentence: Using ne_chunk is a little verbose and it doesn't get you the possessives. Show activity on this post. I think what you need is a tagger, a part-of-speech tagger.
Stanford Parser works seamlessly with updated NLTK package. You’d better update your existing NLTK to avoid any kind of error. Enter the following command on Command Prompt to update your NLTK to latest release. 2. Download the required files Download the following Stanford Parser packages.
Proper nouns identify specific people, places, and things. Extracting entities such as the proper nouns make it easier to mine data. For e.g. we can perform named entity extraction, where an algorithm takes a string of text (sentence or paragraph) as input and identifies the relevant nouns (people, places, and organizations) present in it.
In order to run the Python code below, you must have NLTK and its associated packages installed. You can refer to the link for installation: How to install NLTK .
Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.
For the proper cased sentence we see that the NER works properly:
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O
And for the lowered cased sentence, you will not get NNP
for POS tag nor any NER tag:
>>> for token in annotated_sent1['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
So the question to your question should be:
And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.
If the input is lower-cased and it's because of how you structured your NLP tool chain, then
If the input is lower-cased because that's how the original data was, then:
If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With