Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html with SAX parser

I am trying to parse the normal html file using SAX parser.

SAXBuilder builder2 = new SAXBuilder();
         try {
            Document sdoc = (Document)builder2.build(readFile);
            NodeList nl=sdoc.getElementsByTagName("body");
            System.out.println("nodelist>>>>>>>>>>>"+nl.getLength());

        } catch (JDOMException e1) {
            e1.printStackTrace();
        }

but i am getting the exception

Open quote is expected for attribute "{1}" associated with an  element type  "class".

can anyone please tell me why i am getting this exception, the html document is well formed and it has all the open and close tags properly.

Thanks in advance.

like image 927
user972590 Avatar asked Oct 19 '11 06:10

user972590


3 Answers

As flash says, you need an HTML parser, not an XML parser. HTML is not XML.

Good parsers i've used are Neko and TagSoup. Neko is a good all-round parser; TagSoup specifically aims to be able to parse anything, no matter how ill-formed.

like image 136
Tom Anderson Avatar answered Sep 29 '22 17:09

Tom Anderson


Generally speaking, you cannot parse HTML with an XML parser:

  • HTML's element tags are not required to match in all cases. (For example a <p> tag does not require a matching </p> tag.) This will cause terminal indigestion for an XML parser.

  • Real-world HTML is notorious for not being conformant to the HTML spec, let alone an XML compatible subset of HTML.

However, if your input document is XHTML, you should in theory be able to use an XML parser such as SAX. You should even be able to validate the document against the XHTML schema.

like image 24
Stephen C Avatar answered Sep 29 '22 18:09

Stephen C


Please have a look at HtmlParser. Normally SAX is not a good solution to parse html.

like image 24
flash Avatar answered Sep 29 '22 17:09

flash