I am trying to use BeautifulSoup to parse an HTML file consists of many individual documents downloaded as a batch from LexisNexis (legal database).
My first task is to split the HTML file into its constituent documents. I thought this would be easy since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC>
and so on.
However, this <DOC>
tag is an XML tag, not an HTML tag (all other tags in the file are HTML). Due to this, with the regular HTML parser, this tag is not available in the tree.
How can I build a parser in bs4 that will pick up this XML tag? I enclose the relevant section of the HTML file:
<!-- Hide XML section from browser <DOC NUMBER=1> <DOCFULL> --> BODY <!-- Hide XML section from browser </DOCFULL> </DOC> -->
You can specify xml in bs4 when your BeautifulSoup object is instantiated:
xml_soup = BeautifulSoup(xml_object, 'xml')
This should take care of your issue. You can use the xml_soup
object to parse the remaining html, however I'd recommend instantiating another soup object specifically for html:
soup = BeautifulSoup(html_object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With