é shown as é after dom conversion in java

Question

I'm trying to convert a HTML String to a dom to make some dom level changes and converting it back to a String. The HTML is in French and characters such as é are shown as &ampeacute; is the converted String after transformation.

TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);

String modifiedContent = "";
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
modifiedContent = writer.toString();

"Résultats de recherche" is a string, after dom is converted to String, "RÃ©sultats de recherche" is the result.

I'm feeding this to an FOP processor to convert it to a pdf so, I need the characters in its original form.

Arnaud Potier · Accepted Answer

It looks normal to me that DOMSource keeps the characters in html form.

You can probably use the Jakarta library unescape html method to convert back the html characters to regular strings. In your case, you should just add this line:

String unescapedHtml = StringEscapeUtils.unescapeHtml4(modifiedContent);

Make sure you add the proper maven dependency to your project.

P.S. There seem to be a newer version of the library on maven central, but I could not find the associated javadoc.

é shown as é after dom conversion in java

Tags:

java

dom

stackMan10

1 Answers

Arnaud Potier

Recent Activity

Donate For Us

é shown as &eacute; after dom conversion in java

Tags:

java

dom

stackMan10

1 Answers

Arnaud Potier

Related questions

Recent Activity

Donate For Us

é shown as é after dom conversion in java