Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ISRIStemmer for Arabic text

I am running the following code on IDLE(Python) and I want to enter Arabic string and get the stemming for it but actually it doesn't work

>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> w= 'حركات'
>>> join = w.decode('Windows-1256')
>>> print st.stem(join).encode('Windows-1256').decode('utf-8')

The result of running it is the same text in w which is 'حركات' which is not the stem

But when do the following:

>>> print st.stem(u'اعلاميون')

The result succeeded and returns the stem which is 'علم'

Why passing some words to stem() function doesn't return the stem?

like image 892
user2822966 Avatar asked Dec 06 '22 03:12

user2822966


2 Answers

This code above won't work in Python 3 because we are trying to decode an object that is already decoded. So, there is no need to decode from UTF-8 anymore.

Here is the new code that should work just fine in Python 3.

import nltk
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
w= 'حركات'
print(st.stem(w))
like image 194
MZe Avatar answered Dec 23 '22 13:12

MZe


Ok, I solved the problem by myself using the following:

w = 'حركات' 
st.stem(w.decode('utf-8'))

and it gives the root correctly which is "حرك"

like image 23
user2822966 Avatar answered Dec 23 '22 14:12

user2822966