Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode XHTML and/or HTML5 entities in Java?

I have some strings that contain XHTML character entities:

"They're quite varied"
"Sometimes the string ∈ XML standard, sometimes ∈ HTML4 standard"
"Therefore -> I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."

Is there any easy way to decode the entities? (I'm using Java)

I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("&apos;", "\'")) as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils has unescapeHtml4 and unescapeXML, but no unescapeXhtml.

EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"

EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.

like image 273
Karol S Avatar asked Feb 19 '14 14:02

Karol S


2 Answers

This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.

like image 155
jmkgreen Avatar answered Sep 23 '22 21:09

jmkgreen


Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?

import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;

public class XHTMLStringEscapeUtils {
    public static final CharSequenceTranslator ESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_ESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
            ).with(StringEscapeUtils.ESCAPE_XML11);

    public static final CharSequenceTranslator UNESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
                    new NumericEntityUnescaper(),
                    new LookupTranslator(EntityArrays.APOS_UNESCAPE)
            );

    public static final String escape(final String input) {
        return ESCAPE_XHTML.translate(input);
    }

    public static final String unescape(final String input) {
        return UNESCAPE_XHTML.translate(input);
    }
}

Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.

You can find a full project with tests here xhtml-string-escape-utils

like image 35
ehe888 Avatar answered Sep 19 '22 21:09

ehe888