Is it possible to do this example using List Comprehensions:
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
for s in a:
b = [el.replace(s,'') for el in b]
What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.
I tried something like:
b = [[el.replace(s,'') for el in b] for s in a ]
but it goes wrong
I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?
a = ['test', 'smth commodo']
Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).
b=10 a=2 | b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution: 0.0000206 | 0.0311071 | 0.0943433 | 4.5012770
Jean Fabre solution: 0.0000871 | 0.1722340 | 0.2635452 | 5.2981001
Jpp solution: 0.0000212 | 0.0474531 | 0.0464369 | 0.2450547
Ajax solution: 0.0000334 | 0.0303891 | 0.5262040 | 11.6994496
Daniel solution: 0.0000167 | 0.0162156 | 0.1301132 | 6.9071504
Kasramvd solution: 0.0000120 | 0.0084146 | 0.1704623 | 7.5648351
We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.
There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub
inside a loop.
>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]
['Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'Nulla lectus ligula',
'imperdiet at porttitor quis',
'commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes'
]
Details
Your pattern will look something like this:
\b # word-boundary - remove if you also want to replace substrings
(
test # word 1
| # regex OR pipe
smth # word 2 ... you get the picture
)
\b # end with another word boundary - again, remove for substr replacement
And this is the compiled regex pattern matcher:
>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)
Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape
.
p = re.compile(r'\b({})\b'.format(
'|'.join([re.escape(word) for word in a]))
)
Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext
.
If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b"
) can be quite long if the list of words to remove is big (O(n)
). regex tests the first word, then the second ...
I suggest you use a replacement function using a set
for word lookup. Return the word itself if not found, else return nothing to remove the word:
a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
import re
result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]
print(result)
[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']
Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w
doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:
a = {'lectus ligula', 'porttitor quis'}
and injecting the result
in a similar filter but with explicit 2 word match:
result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]
So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With