NLTK Tagging spanish words using a corpus

Tags:

nltk

I am trying to learn how to tag spanish words using NLTK.

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.

I have downloaded the cess_esp corpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tag documentation and didn't see anything that suggested I could. I feel like i'm missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you

394

asked Feb 06 '13 15:02

dm03514

1 Answers

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .

Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

Then you can use the tagger to tag the sentence you want. Here's an example code:

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

answered Sep 27 '22 20:09

alvas

Related questions
                            
                                How to install a specific git branch with pipenv
                            
                                Python/python3 executes in Command Prompt, but does not run correctly
                            
                                Are there any "nice to program" GUI toolkits for Python? [closed]
                            
                                Python module dependency
                            
                                How do you dynamically hide form fields in Django?
                            
                                Python generators in various languages [closed]
                            
                                Python - Twisted and Unit Tests
                            
                                Using a Unicode format for Python's `time.strftime()`
                            
                                Can django lazy-load fields in a model?
                            
                                Why is it not safe to modify sequence being iterated on?
                            
                                Python, import string of Python code as module
                            
                                Python: Open a Listening Port Behind a Router (upnp?)
                            
                                How do I write data to csv file in columns and rows from a list in python?
                            
                                Python Child cannot use a Module the Parent Imported
                            
                                NumPy k-th diagonal indices
                            
                                Replace a string located between
                            
                                "object of type 'NoneType' has no len()" error
                            
                                Multiprocessing in Python while limiting the number of running processes
                            
                                ImportError: cannot import name "urandom" [closed]
                            
                                How do I delete the Nth list item from a list of lists (column delete)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With