Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repair HTML Tag Brackets using Python

Tags:

python

string

I have a lot of HTML text, like

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

Sometimes HTML tags, such as <sub>,</sub> are missing their < brackets. This can lead to difficulties later in the code. Now, my question is: How can I detect those missing brackets intelligently and repair them?

The correct text would be:

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .'

Of course, I could hardcode all possible bracket configurations, but that would take too long as there are more errors like that in my text.

text = re.sub( r'</sub ', r'</sub>', text) 
text = re.sub( r' /sub>', r'</sub>', text)

...and the previous code might add another bracket to correct samples.

like image 810
henry Avatar asked Apr 09 '19 23:04

henry


2 Answers

try this

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

text_list = text.split();
for i, word in enumerate(text.split()):
    if 'sub' in word:
        if '<' != word[0]:
            word = '<' + word
        if '>' != word[-1]:
            word += '>'
        text_list[i] = word

result = ' '.join(text_list)
print(result)

output will be

Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .
like image 99
Heaven Avatar answered Oct 08 '22 09:10

Heaven


I would search for an expression like sub.*?/sub. It does not assume anything about the brackets at all, but it will only match sub that is paired with /sub, decreasing the probability of false matches. The reluctant quantitifier *? is necessary to prevent it from matching the first sub and the last /sub:

Couple this with the fact that capture groups are allowed by re.sub:

text = re.sub('<?sub>?(.*?)<?/sub>?', '<sub>\\1</sub>', text)
like image 39
Mad Physicist Avatar answered Oct 08 '22 10:10

Mad Physicist