How can i remove strings from sentences if string matches with strings in list

Tags:

I have a pandas.Series with sentences like this:

0    mi sobrino carlos bajó conmigo el lunes       
1    juan antonio es un tio guay                   
2    voy al cine con ramón                         
3    pepe el panadero siempre se porta bien conmigo
4    martha me hace feliz todos los días

on the other hand, I have a list of names and surnames like this:

l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

I want to match sentences from the series to the names in the list. The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:

'|'.join(l)

I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:

Click to copy

series.apply(lambda x: x in '|'.join(l))

but it returns:

Click to copy

0    False
1    False
2    False
3    False
4    False

which is clearly not ok.

I also tried using str.contains() but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (i.e. I need an exact match).

Could you please point me in the right direction here?

Thank you very much in advance

956

asked Jul 22 '20 10:07

Miguel 2488

3 Answers

If need exact match you can use word boundaries:

Click to copy

series.str.contains('|'.join(rf"\b{x}\b" for x in l))

For remove values by list is use generator comprehension with filtering only non matched values by splitted text:

Click to copy

series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
                            
0                mi sobrino bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

answered Oct 11 '22 16:10

jezrael

Regex to check if a word at the start or at the end or in between

Click to copy

df = pd.DataFrame({'texts': [
                             'mi sobrino carlos bajó conmigo el lunes',
                             'juan antonio es un tio guay',
                             'voy al cine con ramón',
                             'pepe el panadero siempre se porta bien conmigo',
                             'martha me hace feliz todos los días '
                             ]})

names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])

df[df.apply(lambda x: 
            x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]

answered Oct 11 '22 18:10

mujjiga

one option is set intersection:

Click to copy

l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)

answered Oct 11 '22 17:10

Ezer K

Related questions
                            
                                Groupby and shift a dask dataframe
                            
                                WARNING: WARNING:tensorflow:Model was constructed with shape (None, 150) , but it was called on an input with incompatible shape (None, 1)
                            
                                Adding products to cart not working properly
                            
                                When does dataloader shuffle happen for Pytorch?
                            
                                Is there a GO equivalent to python's virtualenv?
                            
                                How to Click the "OK" Button within an Alert using Python + Selenium
                            
                                How to crop OpenCV Image from center
                            
                                How to run Jupyter Notebook with a different version of Python?
                            
                                padding='same' conversion to PyTorch padding=#
                            
                                print 3 columns from pandas data set in a table
                            
                                Pandas - Applying Function to every other row
                            
                                Modifying x ticks labels in seaborn
                            
                                PyInstaller .exe file terminates early without an error message
                            
                                How do you use OpenCV's DisparityWLSFilter in Python?
                            
                                Is there a way in Python to ensure that one argument of my function is another function? [duplicate]
                            
                                Python pivot_table - Add difference column
                            
                                Installing awscli on Alpine - how to fix "ModuleNotFoundError: No module named 'six'"
                            
                                HTML not rendering well when using markdown2 converted
                            
                                In Cython class, what's the difference of using __init__ and __cinit__?
                            
                                Python peewee.ImproperlyConfigured: MySQL driver not installed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can i remove strings from sentences if string matches with strings in list

Tags:

python

list

python-3.x

pandas

nlp

Miguel 2488

People also ask

3 Answers

jezrael

mujjiga

Ezer K

Recent Activity

Donate For Us