We're using Jsoup.clean(String, Whitelist) to process some input, and it appears that Jsoup is adding an extraneous line break just prior to acceptable tags. I've seen a few people post this issue around the internet, but haven't been able to track down a solution.
For instance, let's say we have a very simple string with some bold tags within it, like so:
String htmlToClean = "This is a line with <b>bold text</b> within it."
String returnString = Jsoup.clean(htmlToClean, Whitelist.relaxed());
System.out.println(returnString);
What comes out of the call to the clean() method is something like so:
This is a line with \n<b>bold text</b> within it.
Notice that extraneous "\n" appended just prior to the opening bold tag. I can't seem to track down in the source where this is being appended (although admittedly I'm new to Jsoup).
Has anyone encountered this problem, and better yet, have found some way to avoid this extra, unwanted character to be appended to the string in this way?
Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Hmm... have not seen any options for this.
If you parse the html in Document you have some output settings:
Document doc = Jsoup.parseBodyFragment(htmlToClean);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.body().html());
With prettyPrint off you'll get the following output: This is a line with <b>bold text</b> within it.
Maybe you can write your own clean() method, since the implemented one useses Document's (there' you can disable prettyPrint):
Orginal methods:
public static String clean(String bodyHtml, Whitelist whitelist) {
return clean(bodyHtml, "", whitelist);
}
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
Cleaner cleaner = new Cleaner(whitelist);
Document clean = cleaner.clean(dirty);
return clean.body().html();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With