Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding words from a text document and removing the corresponding rows in dataframe - python

I am having a table with 87 million rows and 5 columns. I have a separate file too, with around 3500 words. I want to check for the words in .txt file and check for that word in 4 columns of the table for each row. If that word is present in any of the columns, then I want to remove those rows. This would help me to reduce the number of rows here considerably. The following is the code I am using,

bad_words = pd.read_csv('badwords.txt')

bad_words.dtypes

words    object
dtype: object

bad_words

        words
    0   word1
    1   word3
    2   word5
    3   word13
    4   word16
data

s.no      column1        column2        column3         column4
1         aaaword1b      aaaword2b      aaaword3b       aaaword4b
2         aaaword5b      aaaword6b      aaaword7b       aaaword8b                         
3         aaaword9b      aaaword10b     aaaword11b      aaaword12b
4         aaaword13b     aaaword14b     aaaword15b      aaaword16b
5         aaaword17b     aaaword18b     aaaword19b      aaaword20b

I want to remove the rows that contain words from the bad word document. The output of this should be,

data

s.no      column1        column2        column3         column4                       
3         aaaword9b       aaaword10b    aaaword11b      aaaword12b
5         aaaword17b      aaaword18b    aaaword19b      aaaword20b

I am trying to do something like,

data[(data['column1'].str.contains("word1|word3|word5|word13|word16")==False)| 
(data['column2'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column3'].str.contains("word1|word3|word5|word13|word16")==False)]

But I am not sure whether we can do it for the entire 3500 words. Also not sure whether this is the efficient way to do for 87 million rows.

Updated the question with string patterns rather that the direct words. Sorry for the bad requirement earlier.

Can anybody suggest me a better way to do this?

Thanks

like image 418
haimen Avatar asked Jun 27 '26 03:06

haimen


1 Answers

You can use apply method to check by row and create a vector indicating whether the row contains anything in the bad_words using the isin method and then subset the original data frame based on the logic vector returned:

data[~data.apply(lambda row: row.isin(bad_words.words).any(), axis = 1)]

#s.no   column1 column2 column3 column4
#2  3     word9  word10  word11  word12
#4  5    word17  word18  word19  word20

For the updated question, here is an option that might work depending on your actual data:

data[~data.apply(lambda row: bad_words.words.apply(lambda w: row.str.contains(w + "(?=\D)").any()).any(), axis = 1)]


# sno      column1     column2     column3     column4
#2  3    aaaword9b  aaaword10b  aaaword11b  aaaword12b
#4  5   aaaword17b  aaaword18b  aaaword19b  aaaword20b
like image 50
Psidom Avatar answered Jun 29 '26 17:06

Psidom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!