Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linguistic tagger incorrectly tagging as 'OtherWord'

I've been using NSLinguisticTagger with sentences and have been encountering a strange issue with sentences such as 'I am hungry' or 'I am drunk'. Whilst one would expect 'I' to be tagged as a pronoun, 'am' as a verb and 'hungry' as an adjective, they are not. Rather they are all tagged as OtherWord.

Is there something I'm doing incorrectly?

NSString *input = @"I am hungry";
NSLinguisticTaggerOptions options = NSLinguisticTaggerOmitWhitespace;
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:[NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options];
tagger.string = input;

[tagger enumerateTagsInRange:NSMakeRange(0, input.length) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {
    NSString *token = [input substringWithRange:tokenRange];
    NSString *lemma = [tagger tagAtIndex:tokenRange.location
                                  scheme:NSLinguisticTagSchemeLemma
                              tokenRange: NULL
                           sentenceRange:NULL];
    NSLog(@"%@ (%@) : %@\n", token, lemma, tag);
}];

And the output is:

I ((null)) : OtherWord
am ((null)) : OtherWord
hungry ((null)) : OtherWord
like image 721
Joshua Avatar asked Mar 27 '15 22:03

Joshua


People also ask

What is a part of speech tagger?

The collection of tags used for a particular task is known as a Tagset. A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word. Lets first run the below coed and see what exactly are we talking about. Below is the output. This is the POS tag list which we are talking about.

What tag set does the English part-of-speech tagger use?

In SpaCy, the English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. It also maps the tags to the simpler Universal Dependencies v2 POS tag set. The following table shows the descriptions of the tag set. .

What are the different versions of Stanford tagger?

There are two download versions available, the basic English Stanford Tagger version 4.x.x and the full version of the Stanford Tagger version 4.2.x including additional models for English as well as models for Arabic, Chinese, French, Spanish, and German Unzip the .zip archive to a directory of your choice.

What version of ontonotes does the part-of-speech tagger use?

In SpaCy, the English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. It also maps the tags to the simpler Universal Dependencies v2 POS tag set.


1 Answers

After quite some time in chat we found the issue:

The sentence does not contain enough information to determine its language.

To fix this you can either:

add a demo sentence in your language of choice after your actual sentence. That should guarantee your preferred language gets detected.

OR

Tell the tagger what language to use: add the line

[tagger setOrthography:[NSOrthography orthographyWithDominantScript:@"Latn" languageMap:@{@"Latn" : @[@"en"]}] range:NSMakeRange(0, input.length)];

before the enumerate call. That way you explicitly tell the tagger what language you want the text to be in, in this case englisch (en) as part of the latin dominant language (Latn).

If you dont know the language for sure, it may be usefull to use either of theses methods only as a fallback if the words get tagged as OtherWord meaning the language could not be detected.

like image 190
luk2302 Avatar answered Sep 21 '22 15:09

luk2302