How can Jsoup output well-formed XML?

Tags:

When Jsoup encounters certain types of HTML (either complex or incorrect) it may emit HTML that is badly formed. An example is:

Click to copy

<html>
 <head>
  <meta name="x" content="y is "bad" here">
 </head>
 <body/>
</html>

where the quotes should have been escaped. When Jsoup parses this it emits:

Click to copy

<html>
 <head>
  <meta name="x" content="y is " bad"="" here"="" />
 </head>
 <body></body>
</html>

which is not conformant HTML or XML. This is problematic as it will fail at the next parser down the chain.

Is there any way of ensuring that Jsoup either emits an error message or (like HtmlTidy) can output well-formed XML even if it has lost some information (after all we cannot now be sure what is correct).

UPDATE: The code that fails is:

Click to copy

    @Test
public void testJsoupParseMetaBad() {
    String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
    Document doc = Jsoup.parse(s);
    String ss = doc.toString();
        Assert.assertEquals("<html> <head> <meta name=\"x\" content=\"y is \""
            +" bad\"=\"\" here\"=\"\" /> </head> <body></body> </html>", ss);
}

I am using:

Click to copy

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>

Others seem to have the same problem: JSoup - Quotations inside attributes The answer there doesn't help me as I have to accept what I am given

494

asked Jan 07 '14 15:01

peter.murray.rust

1 Answers

The problem is when you parse because jsoup is creating 3 attributes from:

Click to copy

content="y is "bad" here"

and the name of the attributes contains quote " character. Jsoup does escape values for the attributes but not its name.

Since you are building the html doc from a string you could get the error on parse phase. There is a method that is getting a org.jsoup.parser.Parser as argument. The default parse method is not tracking errors.

Click to copy

    String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
    Parser parser = Parser.htmlParser(); // or Parser.xmlParser
    parser.setTrackErrors(100);
    Document doc = Jsoup.parse(s, "", parser);
    System.out.println(parser.getErrors());

Output:

[37: Unexpected character 'a' in input state [AfterAttributeValue_quoted], 40: Unexpected character ' ' in input state [AttributeName], 46: Unexpected character '>' in input state [AttributeName]]

In case you don't want to change the parse and just want to get a valid output you could just remove invalid attributes:

Click to copy

public static void fixIt(Document doc) {
    Elements els = doc.getAllElements();
    for(Element el:els){
        Attributes attributes = el.attributes();
        Set<String> remove = new HashSet<>();
        for(Attribute a:attributes){
            if(isForbidden(a.getKey())){
               remove.add(a.getKey());
            }
        }

        for(String k:remove){
            el.removeAttr(k);
        }
    }
}

public static boolean isForbidden(String el) {
    return el.contains("\""); //TODO
}

172

answered Oct 04 '22 01:10

Liviu Stirb

Related questions
                            
                                How to delete history entries added through history.pushstate?
                            
                                JavaScript: Exception thrown in parent window occurs in child window
                            
                                How redraw on canvas resize without blurring?
                            
                                jQuery input event fired on placeholder in IE
                            
                                Limited scrolling for an Image
                            
                                HTML 5 video js update source and play once loaded
                            
                                Placing a value inside the thumb of a range input
                            
                                Hotmail adding p tags within email, creating unwanted spacing
                            
                                Custom Login with htaccess through HTML/PHP
                            
                                convert image from base64 to image and save in database in django
                            
                                jquery ui tooltip with html content
                            
                                Mitigation techniques for Internet Explorer DOM insertion speed
                            
                                CSS Animated Side-button-tag
                            
                                Node.js Webm live stream server: issues with <video> tag
                            
                                Textarea Not Sending Value through Form
                            
                                Bootstrap 3: how to control input width inside input-group?
                            
                                Include external stylesheets internally for email templates
                            
                                Google Map to show markers for multiple location
                            
                                How to query elements by attribute value instead of attribute name
                            
                                Content breaks when I minimize in IE6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can Jsoup output well-formed XML?

Tags:

html

xml

jsoup

peter.murray.rust

People also ask

1 Answers

Liviu Stirb

Recent Activity

Donate For Us