Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain multiple taggings with Stanford POS Tagger

I'm performing POS tagging with the Stanford POS Tagger. The tagger only returns one possible tagging for the input sentence. For instance, when provided with the input sentence "The clown weeps.", the POS tagger produces the (erroneous) "The_DT clown_NN weeps_NNS ._.".

However, my application will try to parse the result, and may reject a POS tagging because there is no way to parse it. Hence, in this example, it would reject "The_DT clown_NN weeps_NNS ._." but would accept "The_DT clown_NN weeps_VBZ ._." which I assume is a lower-confidence tagging for the parser.

I would therefore like the POS tagger to provide multiple hypotheses for the tagging of each word, annotated by some kind of confidence value. In this way, my application could choose the POS tagging with highest confidence that achieves a valid parsing for its purposes.

I have found no way to ask the Stanford POS Tagger to produce multiple (n-best) tagging hypotheses for each word (or even for the whole sentence). Is there a way to do this? (Alternatively, I am also OK with using another POS tagger with comparable performance that would have support for this.)

like image 900
a3nm Avatar asked May 28 '13 12:05

a3nm


People also ask

What are the two main methods used for POS tagging?

POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

How POS tagging is done?

The POS tagging process is the process of finding the sequence of tags which is most likely to have generated a given word sequence. We can model this POS process by using a Hidden Markov Model (HMM), where tags are the hidden states that produced the observable output, i.e., the words.


1 Answers

OpenNLP allows getting n best for POS tagging:

Some applications need to retrieve the n-best pos tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as tag.

Sequence topSequences[] = tagger.topKSequences(sent);

Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.

Also, there is also a way to make spaCy do something like this:

Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc._.tag_scores[token.i])

class ProbabilityTagger(Tagger):
    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for i, doc_scores in enumerate(scores):
            docs[i]._.tag_scores = doc_scores
            doc_guesses = doc_scores.argmax(axis=1)

            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs


Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)

Then each token will have tag_scores with the probabilities for each part of speech from spaCy's tag map.

Source: https://github.com/explosion/spaCy/issues/2087

like image 57
Anastasiia Iurshina Avatar answered Oct 20 '22 07:10

Anastasiia Iurshina