Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to retrieve only arabic texts from a string using regular expression?

I have a string which has both Arabic and English sentences. What I want is to extract Arabic Sentences only.

my_string="""
What is the reason
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
behind this?
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
"""

This Link shows that the Unicode range for Arabic letters is 0600-06FF.

So, very basic attempt came to my mind is:

import re
print re.findall(r'[\u0600-\u06FF]+',my_string)

But, this fails miserably as it returns the following list.

['What', 'is', 'the', 'reason', 'behind', 'this?']

As you can see, this is exactly opposite of what I want. What I am missing here?

N.B.

I know I can match the Arabic letters by using inverse matching like below:

print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

But, I don't want that.

like image 426
Ahsanul Haque Avatar asked Apr 16 '16 08:04

Ahsanul Haque


3 Answers

You can use re.sub to replace ascii characters with empty string.

>>> my_string="""
... What is the reason
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... behind this?
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

Your regex didn't work because you are using Python 2 and your string is str you need to convert my_string to unicode for it to work. However it did perfectly work on Python3.x

>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ
like image 91
styvane Avatar answered Oct 13 '22 01:10

styvane


Your code is:

print re.findall(r'[\u0600-\u06FF]+',my_string)

When matching a byte sequence, there is no such concept as Unicode code points. Therefore, the \u escape sequences in the regular expression don’t make any sense. They are not interpreted as you thought, but just mean u.

So when parsing the regular expression for bytes, it is equivalent to:

print re.findall(r'[u0600-u06FF]+',my_string)

This character class is interpreted as “one of u060, or a byte in the range 0-u, or one of 06FF”. This, in turn, is equivalent to [0-u], since all the other bytes are already included in this range.

print re.findall(r'[0-u]+', my_string)

Demonstration:

my_string = "What is thizz?"
print re.findall(r'[\u0600-\u06FF]+',my_string)
['What', 'is', 'thi', '?']

Note that the zz is not matched, since it comes behind u in the ASCII character set.

like image 37
Roland Illig Avatar answered Oct 13 '22 01:10

Roland Illig


Your original code was correct, just needed to encode my_string with proper encoding, 'utf-8' and add u in your re pattern since you are working with Python2,

>>> for x in re.findall(ur'[\u0600-\u06FF]+', my_string.decode('utf-8')):
        print x


ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ

This will give you a list of matched unicode strings instead of single characters that you don't need to join them back with ''.join

If you were in Python3, you don't need any of encoding tweeking as default encoding is 'utf-8':

>>> for x in re.findall(r'[\u0600-\u06FF]+', my_string):
        print(x)


ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
like image 38
Iron Fist Avatar answered Oct 13 '22 00:10

Iron Fist