Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML parsing using DOM-Java

Tags:

java

html

dom

I want to parse a HTML file using Java and i have used DocumentBuilder class for it. My HTML contains a <img src="xyz"> tag, without a closing </img> tag,which is allowed in browser.But when i give it to DocumentBuilder for parsing it gives me this error

The element type "img" must be terminated by the matching end-tag </img>.

Java :

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document document = docBuilder.parse(is);

What should i do to get rid of this error?

like image 527
Vallabh Lakade Avatar asked Jun 30 '26 21:06

Vallabh Lakade


1 Answers

DocumentBuilder is part of Java's XML parsing framework. An XML parser will not correctly parse HTML: the languages look similar, but XML has stricter requirements. (You've already seen one of the differences: in XML, all tags should have a matching end tag, while in HTML some tags do and some don't.)

Try a HTML parser instead. I've heard good things about jsoup (http://jsoup.org/).

like image 98
Wander Nauta Avatar answered Jul 02 '26 09:07

Wander Nauta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!