I've been using NSLinguisticTagger
with sentences and have been encountering a strange issue with sentences such as 'I am hungry' or 'I am drunk'. Whilst one would expect 'I' to be tagged as a pronoun, 'am' as a verb and 'hungry' as an adjective, they are not. Rather they are all tagged as OtherWord
.
Is there something I'm doing incorrectly?
NSString *input = @"I am hungry";
NSLinguisticTaggerOptions options = NSLinguisticTaggerOmitWhitespace;
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:[NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options];
tagger.string = input;
[tagger enumerateTagsInRange:NSMakeRange(0, input.length) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {
NSString *token = [input substringWithRange:tokenRange];
NSString *lemma = [tagger tagAtIndex:tokenRange.location
scheme:NSLinguisticTagSchemeLemma
tokenRange: NULL
sentenceRange:NULL];
NSLog(@"%@ (%@) : %@\n", token, lemma, tag);
}];
And the output is:
I ((null)) : OtherWord
am ((null)) : OtherWord
hungry ((null)) : OtherWord
The collection of tags used for a particular task is known as a Tagset. A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word. Lets first run the below coed and see what exactly are we talking about. Below is the output. This is the POS tag list which we are talking about.
In SpaCy, the English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. It also maps the tags to the simpler Universal Dependencies v2 POS tag set. The following table shows the descriptions of the tag set. .
There are two download versions available, the basic English Stanford Tagger version 4.x.x and the full version of the Stanford Tagger version 4.2.x including additional models for English as well as models for Arabic, Chinese, French, Spanish, and German Unzip the .zip archive to a directory of your choice.
In SpaCy, the English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. It also maps the tags to the simpler Universal Dependencies v2 POS tag set.
After quite some time in chat we found the issue:
The sentence does not contain enough information to determine its language.
To fix this you can either:
add a demo sentence in your language of choice after your actual sentence. That should guarantee your preferred language gets detected.
OR
Tell the tagger what language to use: add the line
[tagger setOrthography:[NSOrthography orthographyWithDominantScript:@"Latn" languageMap:@{@"Latn" : @[@"en"]}] range:NSMakeRange(0, input.length)];
before the enumerate
call. That way you explicitly tell the tagger what language you want the text to be in, in this case englisch (en
) as part of the latin dominant language (Latn
).
If you dont know the language for sure, it may be usefull to use either of theses methods only as a fallback if the words get tagged as OtherWord
meaning the language could not be detected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With