Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I prevent closing of tags in bad HTML using BeautifulSoup (python)?

I automatically translate content of HTML pages to different language, so I have to extract all text nodes from different HTML pages that are sometimes badly written (I have no possibility to edit these HTMLs).

By using BeautifulSoup i can extract those texts easily and replace it with translation, but when I display HTML after these operation: html = BeautifulSoup(source_html) - it's sometimes broken because BeautifulSoup automatically closes tags (for instance table tag is closed in wrong place).

Is there a way to prevent BeautifulSoup from closing these tags?

For instance this is my input:

html = "<table><tr><td>some text</td></table>" - closing tr is missing

after soup = BeautufulSoup(html) i get "<table><tr><td>some text</td></tr></table>"

and i want to get the very same html as input...

Is it possible at all?

like image 512
pawel Avatar asked Oct 10 '22 23:10

pawel


1 Answers

BeautifulSoup excels in parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags (which may not be what you want). See the section on Parsing HTML in the docs which ends with an example that sounds very similar to your situation.

If you know what's wrong with your tags and understand the rules that BeautifulSoup uses, you may be able to augment you HTML slightly (perhaps remove or move certain tags) to make BeautifulSoup return the output you want.

If you can post a short example, someone might be able to give you more specific help.


Update (some examples)

For example, consider the example given in the docs (linked above):

from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
 <table>
 <td><input name="input1">Row 1 cell 1
 <tr><td>Row 2 cell 1
 </form> 
 <td>Row 2 cell 2<br>This</br> sure is a long cell
</body> 
</html>"""
print BeautifulSoup(html).prettify()

The <table> tag will be closed before </form> to ensure that the table is properly nested within the form, leaving the last <td> hanging.

If we understand the problem, we can get the correct closing tab (</table>) by removing "<form>" before parsing:

>>> html = html.replace("<form>", "")
>>> soup = BeautifulSoup(html)
>>> print soup.prettify()
<html>
 <table>
  <td>
   <input name="input1" />
   Row 1 cell 1
  </td>
  <tr>
   <td>
    Row 2 cell 1
   </td>
   <td>
    Row 2 cell 2
    <br />
    This
    sure is a long cell
   </td>
  </tr>
 </table>
</html>

If the <form> tag IS important, you can still add it after parsing. For example:

>>> new_form = Tag(soup, "form")  # create form element
>>> soup.html.insert(0, new_form)  # insert form as child of html
>>> new_form.insert(0, soup.table.extract()) # move table into form
>>> print soup.prettify()
<html>
 <form>
  <table>
   <td>
    <input name="input1" />
    Row 1 cell 1
   </td>
   <tr>
    <td>
     Row 2 cell 1
    </td>
    <td>
     Row 2 cell 2
     <br />
     This
     sure is a long cell
    </td>
   </tr>
  </table>
 </form>
</html>
like image 187
Shawn Chin Avatar answered Oct 18 '22 08:10

Shawn Chin