I'm trying to convert a HTML String to a dom to make some dom level changes and converting it back to a String. The HTML is in French and characters such as é are shown as é
is the converted String after transformation.
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
String modifiedContent = "";
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
modifiedContent = writer.toString();
"Résultats de recherche" is a string, after dom is converted to String, "Résultats de recherche
" is the result.
I'm feeding this to an FOP processor to convert it to a pdf so, I need the characters in its original form.
It looks normal to me that DOMSource keeps the characters in html form.
You can probably use the Jakarta library unescape html method to convert back the html characters to regular strings. In your case, you should just add this line:
String unescapedHtml = StringEscapeUtils.unescapeHtml4(modifiedContent);
Make sure you add the proper maven dependency to your project.
P.S. There seem to be a newer version of the library on maven central, but I could not find the associated javadoc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With