I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص ل Do you think Stanford NLP can do that? Best Regards,

The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming: <ul> <li>تكتبون ← ت+ كتب +ون</li> <li>يتصل ← ي+ تصل</li> </ul> If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/). Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in <code>edu.stanford.nlp.international.arabic.process.IOBUtils</code>): <ul> <li>Split a word between two characters</li> <li>Transform lil- (للـ) into li+ al- (ل+ الـ)</li> <li>Transform ta (ت) or ha (ه) into ta marbuta (ة)</li> <li>Transform ya (ي) or alif (ا) into alif maqsura (ى)</li> <li>Transform alif maqsura (ى) into ya (ي)</li> </ul> So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة). The version of the Stanford segmenter available for download also only breaks off pronouns and particles: وسيكتشفونه ← و+ س+ يكتشفون +ه However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization: وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called <code>parse_integrated</code> with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d Then follow the instructions for "TRAINING THE SEGMENTER" in the README.

Arabic lemmatization and Stanford NLP

Tags:

nlp

stanford-nlp

lemmatization

lexical-analysis

stemming

I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص ل

Do you think Stanford NLP can do that?

Best Regards,

382

asked Mar 19 '15 17:03

Riadh Belkebir

1 Answers

The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming:

تكتبون ← ت+ كتب +ون
يتصل ← ي+ تصل

If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).

Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in edu.stanford.nlp.international.arabic.process.IOBUtils):

Split a word between two characters
Transform lil- (للـ) into li+ al- (ل+ الـ)
Transform ta (ت) or ha (ه) into ta marbuta (ة)
Transform ya (ي) or alif (ا) into alif maqsura (ى)
Transform alif maqsura (ى) into ya (ي)

So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة).

The version of the Stanford segmenter available for download also only breaks off pronouns and particles:

وسيكتشفونه ← و+ س+ يكتشفون +ه

However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization:

وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه

Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called parse_integrated with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d

Then follow the instructions for "TRAINING THE SEGMENTER" in the README.

197

answered Sep 28 '22 16:09

futurulus

Related questions
                            
                                Lemmatize a doc with spacy?
                            
                                How can a machine learning model handle unseen data and unseen label?
                            
                                How to get token ids using spaCy (I want to map a text sentence to sequence of integers)
                            
                                `return_sequences = False` equivalent in pytorch LSTM
                            
                                How to find singular in the plural when some letters change? What is the best approach?
                            
                                Natural Language Processing Package
                            
                                Anyone know of some good Word Sense Disambiguation software? [closed]
                            
                                stanford Core NLP: Splitting sentences from text
                            
                                Algorithm to generate context free grammar from any regex
                            
                                Lexicon dictionary for synonym words
                            
                                Difference between Semantic Web and NLP?
                            
                                How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?
                            
                                No such file or directory 'nltk_data/corpora/stopwords/English' when using colab
                            
                                Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."
                            
                                How nltk.TweetTokenizer different from nltk.word_tokenize?
                            
                                How to create the negative of a sentence in nltk
                            
                                What is Two-Level Morphology?
                            
                                how to write spacy matcher of POS regex
                            
                                NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"
                            
                                What does "word count" refer to when calculating unigram probabilities in an unigram language model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With