Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tagsoup fails to parse html document from a StringReader ( java )

I have this function:

private Node getDOM(String str) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.

When I use an overloaded version of the above method:

private Node getDOM(URL url) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.

Storing the string in a file and reading it back does not work - still getting the same results.

What could be the problem?

like image 359
zajcev Avatar asked Nov 05 '22 16:11

zajcev


1 Answers

Is it maybe a problem with the xml encoding?

like image 176
mickthompson Avatar answered Nov 12 '22 17:11

mickthompson