Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can Jsoup output well-formed XML?

Tags:

html

xml

jsoup

When Jsoup encounters certain types of HTML (either complex or incorrect) it may emit HTML that is badly formed. An example is:

<html>
 <head>
  <meta name="x" content="y is "bad" here">
 </head>
 <body/>
</html>

where the quotes should have been escaped. When Jsoup parses this it emits:

<html>
 <head>
  <meta name="x" content="y is " bad"="" here"="" />
 </head>
 <body></body>
</html>

which is not conformant HTML or XML. This is problematic as it will fail at the next parser down the chain.

Is there any way of ensuring that Jsoup either emits an error message or (like HtmlTidy) can output well-formed XML even if it has lost some information (after all we cannot now be sure what is correct).

UPDATE: The code that fails is:

    @Test
public void testJsoupParseMetaBad() {
    String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
    Document doc = Jsoup.parse(s);
    String ss = doc.toString();
        Assert.assertEquals("<html> <head> <meta name=\"x\" content=\"y is \""
            +" bad\"=\"\" here\"=\"\" /> </head> <body></body> </html>", ss);
}

I am using:

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>

Others seem to have the same problem: JSoup - Quotations inside attributes The answer there doesn't help me as I have to accept what I am given

like image 494
peter.murray.rust Avatar asked Jan 07 '14 15:01

peter.murray.rust


People also ask

Does Jsoup work with XML?

jsoup is a Java library to work with HTML and XML markups. jsoup provides an API to extract and manipulate markup data, allowing us to scrape and parse HTML and XML from a URL, file, or string.

What happens if the parser finds that the document is not valid?

If the parser finds that the document is not valid, then an error event is generated.


1 Answers

The problem is when you parse because jsoup is creating 3 attributes from:

content="y is "bad" here" 

and the name of the attributes contains quote " character. Jsoup does escape values for the attributes but not its name.

Since you are building the html doc from a string you could get the error on parse phase. There is a method that is getting a org.jsoup.parser.Parser as argument. The default parse method is not tracking errors.

    String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
    Parser parser = Parser.htmlParser(); // or Parser.xmlParser
    parser.setTrackErrors(100);
    Document doc = Jsoup.parse(s, "", parser);
    System.out.println(parser.getErrors());

Output:

[37: Unexpected character 'a' in input state [AfterAttributeValue_quoted], 40: Unexpected character ' ' in input state [AttributeName], 46: Unexpected character '>' in input state [AttributeName]]

In case you don't want to change the parse and just want to get a valid output you could just remove invalid attributes:

public static void fixIt(Document doc) {
    Elements els = doc.getAllElements();
    for(Element el:els){
        Attributes attributes = el.attributes();
        Set<String> remove = new HashSet<>();
        for(Attribute a:attributes){
            if(isForbidden(a.getKey())){
               remove.add(a.getKey());
            }
        }

        for(String k:remove){
            el.removeAttr(k);
        }
    }
}

public static boolean isForbidden(String el) {
    return el.contains("\""); //TODO
}
like image 172
Liviu Stirb Avatar answered Oct 04 '22 01:10

Liviu Stirb