How to retrieve only arabic texts from a string using regular expression?

Question

I have a string which has both Arabic and English sentences. What I want is to extract Arabic Sentences only.

my_string="""
What is the reason
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
behind this?
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
"""

This Link shows that the Unicode range for Arabic letters is 0600-06FF.

So, very basic attempt came to my mind is:

import re
print re.findall(r'[\u0600-\u06FF]+',my_string)

But, this fails miserably as it returns the following list.

['What', 'is', 'the', 'reason', 'behind', 'this?']

As you can see, this is exactly opposite of what I want. What I am missing here?

N.B.

I know I can match the Arabic letters by using inverse matching like below:

print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

But, I don't want that.

styvane · Accepted Answer

You can use re.sub to replace ascii characters with empty string.

>>> my_string="""
... What is the reason
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... behind this?
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

Your regex didn't work because you are using Python 2 and your string is str you need to convert my_string to unicode for it to work. However it did perfectly work on Python3.x

>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ

Roland Illig · Answer

Your code is:

print re.findall(r'[\u0600-\u06FF]+',my_string)

When matching a byte sequence, there is no such concept as Unicode code points. Therefore, the \u escape sequences in the regular expression don’t make any sense. They are not interpreted as you thought, but just mean u.

So when parsing the regular expression for bytes, it is equivalent to:

print re.findall(r'[u0600-u06FF]+',my_string)

This character class is interpreted as “one of u060, or a byte in the range 0-u, or one of 06FF”. This, in turn, is equivalent to [0-u], since all the other bytes are already included in this range.

print re.findall(r'[0-u]+', my_string)

Demonstration:

my_string = "What is thizz?"
print re.findall(r'[\u0600-\u06FF]+',my_string)
['What', 'is', 'thi', '?']

Note that the zz is not matched, since it comes behind u in the ASCII character set.

Iron Fist · Answer

Your original code was correct, just needed to encode my_string with proper encoding, 'utf-8' and add u in your re pattern since you are working with Python2,

>>> for x in re.findall(ur'[\u0600-\u06FF]+', my_string.decode('utf-8')):
        print x


ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ

This will give you a list of matched unicode strings instead of single characters that you don't need to join them back with ''.join

If you were in Python3, you don't need any of encoding tweeking as default encoding is 'utf-8':

>>> for x in re.findall(r'[\u0600-\u06FF]+', my_string):
        print(x)


ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ

How to retrieve only arabic texts from a string using regular expression?

Tags:

python

string

regex

unicode

python-2.7

Ahsanul Haque

3 Answers

styvane

Roland Illig

Iron Fist

Recent Activity

Donate For Us

How to retrieve only arabic texts from a string using regular expression?

Tags:

python

string

regex

unicode

python-2.7

Ahsanul Haque

3 Answers

styvane

Roland Illig

Iron Fist

Related questions

Recent Activity

Donate For Us