I have a string which has both Arabic and English sentences. What I want is to extract Arabic Sentences only.
my_string="""
What is the reason
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
behind this?
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
"""
This Link shows that the Unicode range for Arabic letters is 0600-06FF
.
So, very basic attempt came to my mind is:
import re
print re.findall(r'[\u0600-\u06FF]+',my_string)
But, this fails miserably as it returns the following list.
['What', 'is', 'the', 'reason', 'behind', 'this?']
As you can see, this is exactly opposite of what I want. What I am missing here?
N.B.
I know I can match the Arabic letters by using inverse matching like below:
print re.findall(r'[^a-zA-Z\s0-9]+',my_string)
But, I don't want that.
You can use re.sub
to replace ascii characters with empty string.
>>> my_string="""
... What is the reason
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... behind this?
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
Your regex didn't work because you are using Python 2 and your string is str
you need to convert my_string
to unicode for it to work. However it did perfectly work on Python3.x
>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ
Your code is:
print re.findall(r'[\u0600-\u06FF]+',my_string)
When matching a byte sequence, there is no such concept as Unicode code points. Therefore, the \u
escape sequences in the regular expression don’t make any sense. They are not interpreted as you thought, but just mean u
.
So when parsing the regular expression for bytes, it is equivalent to:
print re.findall(r'[u0600-u06FF]+',my_string)
This character class is interpreted as “one of u060
, or a byte in the range 0-u
, or one of 06FF
”. This, in turn, is equivalent to [0-u]
, since all the other bytes are already included in this range.
print re.findall(r'[0-u]+', my_string)
Demonstration:
my_string = "What is thizz?"
print re.findall(r'[\u0600-\u06FF]+',my_string)
['What', 'is', 'thi', '?']
Note that the zz
is not matched, since it comes behind u
in the ASCII character set.
Your original code was correct, just needed to encode my_string
with proper encoding, 'utf-8' and add u
in your re
pattern since you are working with Python2,
>>> for x in re.findall(ur'[\u0600-\u06FF]+', my_string.decode('utf-8')):
print x
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
This will give you a list of matched unicode strings instead of single characters that you don't need to join them back with ''.join
If you were in Python3, you don't need any of encoding tweeking as default encoding is 'utf-8':
>>> for x in re.findall(r'[\u0600-\u06FF]+', my_string):
print(x)
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
ذَلِكَ
الْكِتَابُ
لَا
رَيْبَ
فِيهِ
هُدًى
لِلْمُتَّقِينَ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With