Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build html5lib parser to deal with a mixture of XML and HTML tags?

I am trying to use BeautifulSoup to parse an HTML file consists of many individual documents downloaded as a batch from LexisNexis (legal database).

  • My first task is to split the HTML file into its constituent documents. I thought this would be easy since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC> and so on.

  • However, this <DOC> tag is an XML tag, not an HTML tag (all other tags in the file are HTML). Due to this, with the regular HTML parser, this tag is not available in the tree.

  • How can I build a parser in bs4 that will pick up this XML tag? I enclose the relevant section of the HTML file:

    <!-- Hide XML section from browser <DOC NUMBER=1> <DOCFULL> --> BODY <!-- Hide XML section from browser </DOCFULL> </DOC> -->

like image 460
user2054545 Avatar asked Nov 12 '22 09:11

user2054545


1 Answers

You can specify xml in bs4 when your BeautifulSoup object is instantiated:

xml_soup = BeautifulSoup(xml_object, 'xml')

This should take care of your issue. You can use the xml_soup object to parse the remaining html, however I'd recommend instantiating another soup object specifically for html:

soup = BeautifulSoup(html_object)
like image 160
That1Guy Avatar answered Nov 14 '22 23:11

That1Guy