Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to look for sentences having single characters using Python and Pandas

I'm trying to print out sentences from a dataframe that contains words with one character no matter where it is beginning of the sentence middle or end of it, the challenge is my code works perfectly for English script but when I change the scrips say to Arabic it prints wrong output all sentences instead of the wanted output that has one letters in it.


tdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(tdata.count(' ') == 0)
tdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
df = tdata[dftdata'sentences'].str.contains(r'\b\w{1}\b')]
print(df)


a sample set
-----------------------------

#هلو كيفك م هي اخباركم
#ج ليش اتاخرت اليوم
#هو كلش نذل
#ترا اني كلش حباب ذ
#باب المدرسة ب مفتوح
#الحمدالله الكل بخير
#كم نَحنُ كبار  مع هذا نعتمد على والدنا في مصروفنا
#كم نَحنُ كبار  مع هذا لا تعرف كيفَ نتحدث بطريقة لائقة في المجلس
#كم نَحنُ كبار  مع هذا لا نعرف كيفَ نعبر الشارِع بمفردنا
#هل اهتممتم بوالدتكم  انها مسكينة قد لاحظت انها متعبة جداً ً
#كفاكُن كلام اذهبن  ادرسن َ  اجتهدن
#لِمَ لا تجففن مستحضر (طين البحر الميت ) قبل أن تستخدمنه ُ  
#  ادرسن َ  
# أعشقك ِ
#هل   ً
-----------------------------

expected output 

-----------------------------
#هلو كيفك م هي اخباركم
#ج ليش اتاخرت اليوم
#ترا اني كلش حباب ذ
#باب المدرسة ب مفتوح
-----------------------------

Why this code does not work for Arabic script same as English one ?

I also tried another method which did not work

tdata.loc[tdaTa.sentences.str.contains(r"([^\w]|^)\w([^\w]|$)")]

cant we create an array of the letters we want like ( ذ ب ا م ك خ ح ) and when they come alone in a sentence print that sentence wont that be a much better way and less headache ?

like image 413
programming freak Avatar asked Oct 15 '22 06:10

programming freak


People also ask

How do you check if a string contains a character pandas?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series. str. contains("substring") .

How do you check if a string starts with a substring panda?

startswith() function to check whether the values in the column starts with the given string. The . startswith() method in Python returns True if the string starts with the specified value, if not it returns False.

How do you use STR in pandas?

str can be used to access the values of the series as strings and apply several methods to it. Pandas Series. str. contains() function is used to test if pattern or regex is contained within a string of a Series or Index.

What does regex do in pandas?

A regular expression (regex) is a sequence of characters that define a search pattern. To filter rows in Pandas by regex, we can use the str. match() method.


1 Answers

Try this code:

import pandas as pd

# mocking the dataframe based on your sample data
df = pd.DataFrame({'sentences': [
    'هلو كيفك م هي اخباركم',
    'ج ليش اتاخرت اليوم',
    'هو كلش نذل',
    'ترا اني كلش حباب ذ',
    'باب المدرسة ب مفتوح',
    'الحمدالله الكل بخير',
    'كم نَحنُ كبار  مع هذا نعتمد على والدنا في مصروفنا',
    'كم نَحنُ كبار  مع هذا لا تعرف كيفَ نتحدث بطريقة لائقة في المجلس',
    'كم نَحنُ كبار  مع هذا لا نعرف كيفَ نعبر الشارِع بمفردنا',
    'لِمَ لا تجففن مستحضر (طين البحر الميت ) قبل أن تستخدمنه'
]})

# selecting sentences with a single Arabic character word using regex
pattern = '(?<![\u0600-\u06ff])[\u0600-\u06ff](?![\u0600-\u06ff])'
df = df[df['sentences'].str.contains(pattern)]

print(df)
#                sentences
# 0  هلو كيفك م هي اخباركم
# 1     ج ليش اتاخرت اليوم
# 3     ترا اني كلش حباب ذ
# 4    باب المدرسة ب مفتوح

Regex Explanation:

pattern = '(?<![\u0600-\u06ff])[\u0600-\u06ff](?![\u0600-\u06ff])'

[\u0600-\u06ff] is covering the standard Arabic unicode range including letters, numbers, and diacritics. Using negative lookbehind (?<![\u0600-\u06ff]) and negative lookahead (?![\u0600-\u06ff]) we limit the match to any single Arabic character that neither preceded nor followed by another Arabic character.

In case you want match a specific set of characters and not all of them, change the middle [\u0600-\u06ff] to any set you want. For example, if you want to match only Arabic digits from zero to nine, then use [\u0660-\u0669] instead:

pattern = '(?<![\u0600-\u06ff])[\u0660-\u0669](?![\u0600-\u06ff])'

For english, the whole pattern should be like:

pattern = '\b\w\b'
like image 144
mhalshehri Avatar answered Oct 20 '22 02:10

mhalshehri