How to build html5lib parser to deal with a mixture of XML and HTML tags?

Question

I am trying to use BeautifulSoup to parse an HTML file consists of many individual documents downloaded as a batch from LexisNexis (legal database).

My first task is to split the HTML file into its constituent documents. I thought this would be easy since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC> and so on.
However, this <DOC> tag is an XML tag, not an HTML tag (all other tags in the file are HTML). Due to this, with the regular HTML parser, this tag is not available in the tree.
How can I build a parser in bs4 that will pick up this XML tag? I enclose the relevant section of the HTML file:

 BODY

That1Guy · Accepted Answer

You can specify xml in bs4 when your BeautifulSoup object is instantiated:

xml_soup = BeautifulSoup(xml_object, 'xml')

This should take care of your issue. You can use the xml_soup object to parse the remaining html, however I'd recommend instantiating another soup object specifically for html:

soup = BeautifulSoup(html_object)

How to build html5lib parser to deal with a mixture of XML and HTML tags?

Tags:

python

parsing

xml

beautifulsoup

user2054545

1 Answers

That1Guy

Recent Activity

Donate For Us

How to build html5lib parser to deal with a mixture of XML and HTML tags?

Tags:

python

parsing

xml

beautifulsoup

user2054545

1 Answers

That1Guy

Related questions

Recent Activity

Donate For Us