When Jsoup encounters certain types of HTML (either complex or incorrect) it may emit HTML that is badly formed. An example is:
<html>
<head>
<meta name="x" content="y is "bad" here">
</head>
<body/>
</html>
where the quotes should have been escaped. When Jsoup parses this it emits:
<html>
<head>
<meta name="x" content="y is " bad"="" here"="" />
</head>
<body></body>
</html>
which is not conformant HTML or XML. This is problematic as it will fail at the next parser down the chain.
Is there any way of ensuring that Jsoup either emits an error message or (like HtmlTidy) can output well-formed XML even if it has lost some information (after all we cannot now be sure what is correct).
UPDATE: The code that fails is:
@Test
public void testJsoupParseMetaBad() {
String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
Document doc = Jsoup.parse(s);
String ss = doc.toString();
Assert.assertEquals("<html> <head> <meta name=\"x\" content=\"y is \""
+" bad\"=\"\" here\"=\"\" /> </head> <body></body> </html>", ss);
}
I am using:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>
Others seem to have the same problem: JSoup - Quotations inside attributes The answer there doesn't help me as I have to accept what I am given
jsoup is a Java library to work with HTML and XML markups. jsoup provides an API to extract and manipulate markup data, allowing us to scrape and parse HTML and XML from a URL, file, or string.
If the parser finds that the document is not valid, then an error event is generated.
The problem is when you parse because jsoup is creating 3 attributes from:
content="y is "bad" here"
and the name of the attributes contains quote " character. Jsoup does escape values for the attributes but not its name.
Since you are building the html doc from a string you could get the error on parse phase. There is a method that is getting a org.jsoup.parser.Parser as argument. The default parse method is not tracking errors.
String s = "<html><meta name=\"x\" content=\"y is \"bad\" here\"><body></html>";
Parser parser = Parser.htmlParser(); // or Parser.xmlParser
parser.setTrackErrors(100);
Document doc = Jsoup.parse(s, "", parser);
System.out.println(parser.getErrors());
Output:
[37: Unexpected character 'a' in input state [AfterAttributeValue_quoted], 40: Unexpected character ' ' in input state [AttributeName], 46: Unexpected character '>' in input state [AttributeName]]
In case you don't want to change the parse and just want to get a valid output you could just remove invalid attributes:
public static void fixIt(Document doc) {
Elements els = doc.getAllElements();
for(Element el:els){
Attributes attributes = el.attributes();
Set<String> remove = new HashSet<>();
for(Attribute a:attributes){
if(isForbidden(a.getKey())){
remove.add(a.getKey());
}
}
for(String k:remove){
el.removeAttr(k);
}
}
}
public static boolean isForbidden(String el) {
return el.contains("\""); //TODO
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With