I have some strings that contain XHTML character entities:
"They're quite varied"
"Sometimes the string ∈ XML standard, sometimes ∈ HTML4 standard"
"Therefore -> I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."
Is there any easy way to decode the entities? (I'm using Java)
I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("'", "\'"))
as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils
has unescapeHtml4
and unescapeXML
, but no unescapeXhtml
.
EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"
EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.
This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.
Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;
public class XHTMLStringEscapeUtils {
public static final CharSequenceTranslator ESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_ESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
).with(StringEscapeUtils.ESCAPE_XML11);
public static final CharSequenceTranslator UNESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper(),
new LookupTranslator(EntityArrays.APOS_UNESCAPE)
);
public static final String escape(final String input) {
return ESCAPE_XHTML.translate(input);
}
public static final String unescape(final String input) {
return UNESCAPE_XHTML.translate(input);
}
}
Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.
You can find a full project with tests here xhtml-string-escape-utils
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With