I want to filter my text by removing Arabic diacritics using Python.
For example:
| Context | Text |
|---|---|
| Before filtering | اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا |
| After filtering | اللهم اغفر لنا ولوالدينا |
I have found that this can be done using CAMeL Tools but I am not sure how.
You can use the library pyArabic like this:
import pyarabic.araby as araby
before_filter="اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا"
after_filter = araby.strip_diacritics(before_filter)
print(after_filter)
# will print : اللهم اغفر لنا ولوالدينا
You can try different strip filters:
araby.strip_harakat(before_filter) # 'اللّهمّ اغفر لنا ولوالدينا'
araby.strip_lastharaka(before_filter) # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_shadda(before_filter) # 'اللَهمَ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_small(before_filter) # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_tashkeel(before_filter) # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_tatweel(before_filter) # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
You really don't need to use any library for this, just plain regex:
import re
text = 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا '
output=re.sub(u'[\u064e\u064f\u0650\u0651\u0652\u064c\u064b\u064d\u0640\ufc62]','',text)
print(output)
#اللهم اغفر لنا ولوالدينا
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With