Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

é shown as é after dom conversion in java

Tags:

java

dom

I'm trying to convert a HTML String to a dom to make some dom level changes and converting it back to a String. The HTML is in French and characters such as é are shown as &ampeacute; is the converted String after transformation.

TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);

String modifiedContent = "";
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
modifiedContent = writer.toString();

"Résultats de recherche" is a string, after dom is converted to String, "Résultats de recherche" is the result.

I'm feeding this to an FOP processor to convert it to a pdf so, I need the characters in its original form.

like image 923
stackMan10 Avatar asked May 07 '15 07:05

stackMan10


1 Answers

It looks normal to me that DOMSource keeps the characters in html form.

You can probably use the Jakarta library unescape html method to convert back the html characters to regular strings. In your case, you should just add this line:

String unescapedHtml = StringEscapeUtils.unescapeHtml4(modifiedContent);

Make sure you add the proper maven dependency to your project.

P.S. There seem to be a newer version of the library on maven central, but I could not find the associated javadoc.

like image 74
Arnaud Potier Avatar answered Nov 04 '22 13:11

Arnaud Potier