I have a lot of HTML text, like
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub in this text here and another one <sub> here /sub> .'
Sometimes HTML tags, such as <sub>
,</sub>
are missing their <
brackets. This can lead to difficulties later in the code. Now, my question is: How can I detect those missing brackets intelligently and repair them?
The correct text would be:
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .'
Of course, I could hardcode all possible bracket configurations, but that would take too long as there are more errors like that in my text.
text = re.sub( r'</sub ', r'</sub>', text)
text = re.sub( r' /sub>', r'</sub>', text)
...and the previous code might add another bracket to correct samples.
try this
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub in this text here and another one <sub> here /sub> .'
text_list = text.split();
for i, word in enumerate(text.split()):
if 'sub' in word:
if '<' != word[0]:
word = '<' + word
if '>' != word[-1]:
word += '>'
text_list[i] = word
result = ' '.join(text_list)
print(result)
output will be
Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .
I would search for an expression like sub.*?/sub
. It does not assume anything about the brackets at all, but it will only match sub
that is paired with /sub
, decreasing the probability of false matches. The reluctant quantitifier *?
is necessary to prevent it from matching the first sub
and the last /sub
:
Couple this with the fact that capture groups are allowed by re.sub
:
text = re.sub('<?sub>?(.*?)<?/sub>?', '<sub>\\1</sub>', text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With