Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK Tagging spanish words using a corpus

Tags:

python

nltk

I am trying to learn how to tag spanish words using NLTK.

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.

I have downloaded the cess_esp corpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tag documentation and didn't see anything that suggested I could. I feel like i'm missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you

like image 394
dm03514 Avatar asked Feb 06 '13 15:02

dm03514


People also ask

How do you tag words in NLTK?

POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence.

What is NLTK corpus?

The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus format.

Which NLTK function is used for POS-tagging?

with the word_tokenize() function. Then the tokens are POS tagged with the function pos_tag() .

How do you use corpus in Python?

Practical Data Science using Python Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.


1 Answers

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .

Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

Then you can use the tagger to tag the sentence you want. Here's an example code:

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

like image 91
alvas Avatar answered Sep 27 '22 20:09

alvas