I am trying to learn how to tag spanish words using NLTK.
From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.
I have downloaded the cess_esp
corpus. Is there a way to specifiy a corpus in nltk.pos_tag
. I looked at the pos_tag
documentation and didn't see anything that suggested I could. I feel like i'm missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you
POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.
The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus format.
with the word_tokenize() function. Then the tokens are POS tagged with the function pos_tag() .
Practical Data Science using Python Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.
First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .
Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.
Then you can use the tagger to tag the sentence you want. Here's an example code:
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
# Read the corpus into a list,
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()
# Train the unigram tagger
uni_tag = ut(cess_sents)
sentence = "Hola , esta foo bar ."
# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))
# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%
# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])
# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])
# Using the tagger.
bi_tag.tag(sentence.split(" "))
Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.
Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With