Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i remove strings from sentences if string matches with strings in list

I have a pandas.Series with sentences like this:

0    mi sobrino carlos bajó conmigo el lunes       
1    juan antonio es un tio guay                   
2    voy al cine con ramón                         
3    pepe el panadero siempre se porta bien conmigo
4    martha me hace feliz todos los días 

on the other hand, I have a list of names and surnames like this:

l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

I want to match sentences from the series to the names in the list. The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:

'|'.join(l)

I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:

series.apply(lambda x: x in '|'.join(l))

but it returns:

0    False
1    False
2    False
3    False
4    False

which is clearly not ok.

I also tried using str.contains() but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (i.e. I need an exact match).

Could you please point me in the right direction here?

Thank you very much in advance

like image 956
Miguel 2488 Avatar asked Jul 22 '20 10:07

Miguel 2488


People also ask

How do I remove a string from a sentence in Python?

In Python you can use the replace() and translate() methods to specify which characters you want to remove from a string and return a new modified string result. It is important to remember that the original string will not be altered because strings are immutable.

How do I remove a string from a list of strings?

Method #1 : Using remove() This particular method is quite naive and not recommended to use, but is indeed a method to perform this task. remove() generally removes the first occurrence of K string and we keep iterating this process until no K string is found in list.

How do you remove certain elements from a string?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do you remove words from a string?

Given a String and a Word, the task is remove that Word from the String. Approach : In Java, this can be done using String replaceAll method by replacing given word with a blank space.


3 Answers

If need exact match you can use word boundaries:

series.str.contains('|'.join(rf"\b{x}\b" for x in l))

For remove values by list is use generator comprehension with filtering only non matched values by splitted text:

series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
                            
0                mi sobrino bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días
like image 89
jezrael Avatar answered Oct 11 '22 16:10

jezrael


Regex to check if a word at the start or at the end or in between

df = pd.DataFrame({'texts': [
                             'mi sobrino carlos bajó conmigo el lunes',
                             'juan antonio es un tio guay',
                             'voy al cine con ramón',
                             'pepe el panadero siempre se porta bien conmigo',
                             'martha me hace feliz todos los días '
                             ]})

names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])

df[df.apply(lambda x: 
            x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]
like image 1
mujjiga Avatar answered Oct 11 '22 18:10

mujjiga


one option is set intersection:

l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)
like image 1
Ezer K Avatar answered Oct 11 '22 17:10

Ezer K