I am having a table with 87 million rows and 5 columns. I have a separate file too, with around 3500 words. I want to check for the words in .txt file and check for that word in 4 columns of the table for each row. If that word is present in any of the columns, then I want to remove those rows. This would help me to reduce the number of rows here considerably. The following is the code I am using,
bad_words = pd.read_csv('badwords.txt')
bad_words.dtypes
words object
dtype: object
bad_words
words
0 word1
1 word3
2 word5
3 word13
4 word16
data
s.no column1 column2 column3 column4
1 aaaword1b aaaword2b aaaword3b aaaword4b
2 aaaword5b aaaword6b aaaword7b aaaword8b
3 aaaword9b aaaword10b aaaword11b aaaword12b
4 aaaword13b aaaword14b aaaword15b aaaword16b
5 aaaword17b aaaword18b aaaword19b aaaword20b
I want to remove the rows that contain words from the bad word document. The output of this should be,
data
s.no column1 column2 column3 column4
3 aaaword9b aaaword10b aaaword11b aaaword12b
5 aaaword17b aaaword18b aaaword19b aaaword20b
I am trying to do something like,
data[(data['column1'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column2'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column3'].str.contains("word1|word3|word5|word13|word16")==False)]
But I am not sure whether we can do it for the entire 3500 words. Also not sure whether this is the efficient way to do for 87 million rows.
Updated the question with string patterns rather that the direct words. Sorry for the bad requirement earlier.
Can anybody suggest me a better way to do this?
Thanks
You can use apply method to check by row and create a vector indicating whether the row contains anything in the bad_words using the isin method and then subset the original data frame based on the logic vector returned:
data[~data.apply(lambda row: row.isin(bad_words.words).any(), axis = 1)]
#s.no column1 column2 column3 column4
#2 3 word9 word10 word11 word12
#4 5 word17 word18 word19 word20
For the updated question, here is an option that might work depending on your actual data:
data[~data.apply(lambda row: bad_words.words.apply(lambda w: row.str.contains(w + "(?=\D)").any()).any(), axis = 1)]
# sno column1 column2 column3 column4
#2 3 aaaword9b aaaword10b aaaword11b aaaword12b
#4 5 aaaword17b aaaword18b aaaword19b aaaword20b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With