Replace strings using List Comprehensions

Question

Is it possible to do this example using List Comprehensions:

a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']


for s in a:
    b = [el.replace(s,'') for el in b]

What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.

I tried something like:

b = [[el.replace(s,'') for el in b] for s in a ]

but it goes wrong

I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?

a = ['test', 'smth commodo']

Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).

                      b=10 a=2   |  b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution:   0.0000206  |  0.0311071  |  0.0943433   |  4.5012770
Jean Fabre solution:  0.0000871  |  0.1722340  |  0.2635452   |  5.2981001
Jpp solution:         0.0000212  |  0.0474531  |  0.0464369   |  0.2450547
Ajax solution:        0.0000334  |  0.0303891  |  0.5262040   | 11.6994496
Daniel solution:      0.0000167  |  0.0162156  |  0.1301132   |  6.9071504
Kasramvd solution:    0.0000120  |  0.0084146  |  0.1704623   |  7.5648351

We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.

cs95 · Accepted Answer

There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub inside a loop.

>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]

['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes'
]

Details
Your pattern will look something like this:

\b    # word-boundary - remove if you also want to replace substrings
(
test  # word 1
|     # regex OR pipe
smth  # word 2 ... you get the picture
)
\b    # end with another word boundary - again, remove for substr replacement

And this is the compiled regex pattern matcher:

>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)

Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape.

p = re.compile(r'\b({})\b'.format(
    '|'.join([re.escape(word) for word in a]))
)

Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext.

Jean-François Fabre · Answer

If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b") can be quite long if the list of words to remove is big (O(n)). regex tests the first word, then the second ...

I suggest you use a replacement function using a set for word lookup. Return the word itself if not found, else return nothing to remove the word:

a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor',
     'Orci varius natoque penatibus et magnis dis parturient montes']

import re

result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]

print(result)

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:

a = {'lectus ligula', 'porttitor quis'}

and injecting the result in a similar filter but with explicit 2 word match:

result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]

So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.

Replace strings using List Comprehensions

Tags:

python

string

list

list-comprehension

Mikhail_Sam

2 Answers

cs95

Jean-François Fabre

Recent Activity

Donate For Us

Replace strings using List Comprehensions

Tags:

python

string

list

list-comprehension

Mikhail_Sam

2 Answers

cs95

Jean-François Fabre

Related questions

Recent Activity

Donate For Us