Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace strings using List Comprehensions

Is it possible to do this example using List Comprehensions:

a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']


for s in a:
    b = [el.replace(s,'') for el in b]

What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.

I tried something like:

b = [[el.replace(s,'') for el in b] for s in a ]

but it goes wrong


I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?

a = ['test', 'smth commodo']

Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).

                      b=10 a=2   |  b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution:   0.0000206  |  0.0311071  |  0.0943433   |  4.5012770
Jean Fabre solution:  0.0000871  |  0.1722340  |  0.2635452   |  5.2981001
Jpp solution:         0.0000212  |  0.0474531  |  0.0464369   |  0.2450547
Ajax solution:        0.0000334  |  0.0303891  |  0.5262040   | 11.6994496
Daniel solution:      0.0000167  |  0.0162156  |  0.1301132   |  6.9071504
Kasramvd solution:    0.0000120  |  0.0084146  |  0.1704623   |  7.5648351

We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.

like image 782
Mikhail_Sam Avatar asked Apr 23 '18 08:04

Mikhail_Sam


2 Answers

There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub inside a loop.

>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]

['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes'
]

Details
Your pattern will look something like this:

\b    # word-boundary - remove if you also want to replace substrings
(
test  # word 1
|     # regex OR pipe
smth  # word 2 ... you get the picture
)
\b    # end with another word boundary - again, remove for substr replacement

And this is the compiled regex pattern matcher:

>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)

Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape.

p = re.compile(r'\b({})\b'.format(
    '|'.join([re.escape(word) for word in a]))
)

Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext.

like image 193
cs95 Avatar answered Oct 16 '22 17:10

cs95


If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b") can be quite long if the list of words to remove is big (O(n)). regex tests the first word, then the second ...

I suggest you use a replacement function using a set for word lookup. Return the word itself if not found, else return nothing to remove the word:

a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor',
     'Orci varius natoque penatibus et magnis dis parturient montes']

import re

result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]

print(result)

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:

a = {'lectus ligula', 'porttitor quis'}

and injecting the result in a similar filter but with explicit 2 word match:

result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]

So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.

like image 25
Jean-François Fabre Avatar answered Oct 16 '22 17:10

Jean-François Fabre