Multilingual NLTK for POS Tagging and Lemmatizer

Tags:

Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:

tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')

I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?

Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.

Many Thanks.

477

asked Sep 23 '15 13:09

Alessio Schiavelli

1 Answers

If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.

In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:

python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon 
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")

107

answered Sep 18 '22 06:09

NQD

Related questions
                            
                                Do not print "optimization terminated successfully" scipy.optimize.fmin?
                            
                                'Self' of python vs 'this' of cpp/c#
                            
                                numpy array is printed into file with unwanted wrapping
                            
                                Bad disparity map using StereoBM in OpenCV
                            
                                OSError: [Errno 22] Invalid argument in subprocess
                            
                                cpu_percent(interval=None) always returns 0 regardless of interval value PYTHON
                            
                                Is it possible to create grouping of input cells in IPython Notebook?
                            
                                Generate a random derangement of a list
                            
                                Linking Django and Postgresql with Docker
                            
                                Python Pandas: Passing Multiple Functions to agg() with Arguments
                            
                                Flatten DataFrame with multi-index columns
                            
                                Python Selenium get current window handle
                            
                                scipy - generate random variables with correlations
                            
                                Turn off marginal distribution axes on jointplot using seaborn package
                            
                                Why am i getting WindowsError: [Error 5] Access is denied?
                            
                                Tkinter look (theme) in Linux
                            
                                What is the unit of the y-axis when using distplot to plot a histogram?
                            
                                Why would MySQL execute return None?
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                CountVectorizer: Vocabulary wasn't fitted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multilingual NLTK for POS Tagging and Lemmatizer

Tags:

python

nlp

nltk

lemmatization

pos-tagger

Alessio Schiavelli

People also ask

1 Answers

NQD

Recent Activity

Donate For Us