Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python and NLTK: Baseline tagger

I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default. The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?

import nltk 
from nltk.corpus import brown

def findtags(userinput, tagged_text):
    uinput = userinput.split()
    fdist = nltk.FreqDist(tagged_text)
    result = []
    for item in fdist.items():
        for u in uinput:
            if u==item[0][0]:
                t = (u,item[0][1])
                result.append(t)
        continue
        t = (u, "NN")
        result.append(t)
    return result

def main():
    tags = findtags("the quick brown fox", brown.tagged_words())
    print tags

if __name__ == '__main__':
    main()
like image 828
Helena Avatar asked Apr 17 '26 22:04

Helena


1 Answers

If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:

>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]

If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger

By building a UnigramTagger like in spaghetti-tagger, you should automatically achieve the most common tag for every word.

However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain

def type_token_ratio(documentstream):
    ttr = defaultdict(list)
    for token, pos in list(chain(*documentstream)):
        ttr[token].append(pos)  
    return ttr

def most_freq_tag(ttr, word):
    return Counter(ttr[word]).most_common()[0][0]

sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ." 
documents = [sent1, sent2]

# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])

# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]

# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]

NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.

like image 97
alvas Avatar answered Apr 19 '26 11:04

alvas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!