So I have a test string for example
content = 'I opened my mouth, "Good morning!" I said cheerfully'
I want to use regex to remove text in between double speech marks, but not the speech marks themselves. So it will return
'I opened my mouth, "" I said cheerfully'
I am using the following code
content = re.sub(r'".*"'," ",content)
But this removes the double speech marks aswell. What pattern should I use to keep the speech marks but remove the text inside them.
Use '""' as the replacement string:
>>> content = 'I opened my mouth, "Good morning!" I said cheerfully'
>>> content = re.sub(r'".*"', '""', content)
>>> print(content)
I opened my mouth, "" I said cheerfully
BTW, .* matches as much as possible (greedy). To match non-greedy fashion, use .*? or [^"]*.
>>> content = 'I opened my mouth, "Good morning!" I said cheerfully. "How is everyone?"'
>>> content = re.sub(r'".*?"', '""', content)
>>> print(content)
I opened my mouth, "" I said cheerfully. ""
You could also use lookarounds:
(?<=")([^"]+)(?=")

Debuggex Demo
content = re.sub(r'(?<=")([^"]+)(?=")', '', content)
Two notes:
.* will capture everything up to the last double-quote in your string, instead of the next one. This is why I've made it [^"]+.Importantly, this will not work when two doubly-quoted sub-strings are in the overall string, unless you increment the index at which the next search begins. So, for example, with
I opened my mouth, "Good morning!" I said cheerfully. "How is everyone?"
In order to not capture I said cheerfully., you must increment the index by one after `Good morning!" is found.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With