I have a lot of HTML text, like
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'
Sometimes HTML tags, such as <sub>,</sub> are missing their < brackets. This can lead to difficulties later in the code. Now, my question is: How can I detect those missing brackets intelligently and repair them? 
The correct text would be:
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .'
Of course, I could hardcode all possible bracket configurations, but that would take too long as there are more errors like that in my text.
text = re.sub( r'</sub ', r'</sub>', text) 
text = re.sub( r' /sub>', r'</sub>', text)
...and the previous code might add another bracket to correct samples.
try this
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'
text_list = text.split();
for i, word in enumerate(text.split()):
    if 'sub' in word:
        if '<' != word[0]:
            word = '<' + word
        if '>' != word[-1]:
            word += '>'
        text_list[i] = word
result = ' '.join(text_list)
print(result)
output will be
Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .
                        I would search for an expression like sub.*?/sub. It does not assume anything about the brackets at all, but it will only match sub that is paired with /sub, decreasing the probability of false matches. The reluctant quantitifier *? is necessary to prevent it from matching the first sub and the last /sub:
Couple this with the fact that capture groups are allowed by re.sub:
text = re.sub('<?sub>?(.*?)<?/sub>?', '<sub>\\1</sub>', text)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With