Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 : Remove Arabic Punctuation

I'm working in Arabic text , I want to remove the Arabic Punctuation Example :

s="أهلاً بالعالم في هذه التجربة ! علامات ،الترقيم ؟ ,? لا .اتذكرها"

I want the output to remove " ؟ ، " also because when I use:

import string
tr= str.maketrans("","", string.punctuation) 

the output was 'أهلاً بالعالم في هذه التجربة علامات ،الترقيم ؟ لا اتذكرها'

like image 973
Noura Avatar asked Dec 14 '22 18:12

Noura


1 Answers

The string.punctuation constant contains only the punctuation characters defined in ASCII, which does not even cover all signs used with the Latin script (eg. "fancy quotes" like «» are missing).

If you don't want to create a list of all punctuation characters yourself (I wouldn't), you can use the Unicode character property to decide if a character is punctuation or not. The built-in unicodedata module gives you access to this information:

>>> import unicodedata as ud
>>> for c in 'abc: قيم ؟':
...     print((c, ud.category(c))
a Ll
b Ll
c Ll
: Po
  Zs
ق Lo
ي Lo
م Lo
  Zs
؟ Po

All categories are two-letter codes, like "Ll" for "letter, lowercase" or "Po" for "punctuation, other". All punctuation characters have a category that starts with "P".

You can use this information for filtering out punctuation characters (eg. using a generator expression):

>>> s = "أهلاً بالعالم في هذه التجربة ! علامات ،الترقيم ؟ ,? لا .اتذكرها"
>>> ''.join(c for c in s if not ud.category(c).startswith('P'))
'أهلاً بالعالم في هذه التجربة  علامات الترقيم   لا اتذكرها'
like image 96
lenz Avatar answered Dec 18 '22 00:12

lenz