Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

StringEscapeUtils.escapeXml is converting utf8 characters which it should not

escapeXml function is converting ѭ Ѯ to ѭ Ѯ which I guess it should not. What I read is that it Supports only the five basic XML entities (gt, lt, quot, amp, apos).

Is there a function that only converts these five basic xml entities?

like image 521
Mady Avatar asked Jan 24 '12 10:01

Mady


People also ask

What does StringEscapeUtils escapeXml do?

The escapeXml() method of the StringEscapeUtils class of the Commons LangS library can be used to escape a String with these entities. The XmlEscapeTest class demonstrates this. It reads a String of text from the input. txt file and XML escapes the String.

How do I use escapeXml in Java?

JSTL fn:escapeXml() Function The fn:escapeXml() function escapes the characters that would be interpreted as XML markup. It is used for escaping the character in XML markup language. The syntax used for including the fn:escapeXml() function is: java.


4 Answers

public String escapeXml(String s) {
    return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;");
}
like image 162
Bombe Avatar answered Oct 03 '22 02:10

Bombe


The javadoc for the 3.1 version of the library says:

Note that Unicode characters greater than 0x7f are as of 3.0, no longer escaped. If you still wish this functionality, you can achieve it via the following: StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );

So you probably use an older version of the library. Update your dependencies (or reimplement the escape yourself: it's not rocket science)

like image 42
JB Nizet Avatar answered Oct 03 '22 00:10

JB Nizet


The javadoc of StringEscapeUtils.escapeXml says that we have to use

StringEscapeUtils.ESCAPE_XML.with( new UnicodeEscaper(Range.between(0x7f, Integer.MAX_VALUE)) );

But instead of UnicodeEscaper, NumericEntityEscaper has to be used. UnicodeEscaper will change everything to \u1234 symbols, but NumericEntityEscaper escapes as &amp;#123;, that was expected.

package mypackage;

import org.apache.commons.lang3.StringEscapeUtils;
import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
import org.apache.commons.lang3.text.translate.NumericEntityEscaper;

public class XmlEscaper {
    public static void main(final String[] args) {
        final String xmlToEscape = "<hello>Hi</hello>" + "_ _" + "__ __"  + "___ ___" + "after &nbsp;"; // the line cont

        // no Unicode escape
        final String escapedXml = StringEscapeUtils.escapeXml(xmlToEscape);

        // escape Unicode as numeric codes. For instance, escape non-breaking space as &#160;
        final CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
        final String escapedXmlWithUnicode = translator.translate(xmlToEscape);

        System.out.println("xmlToEscape: " + xmlToEscape);
        System.out.println("escapedXml: " + escapedXml); // does not escape Unicode characters like non-breaking space
        System.out.println("escapedXml with unicode: " + escapedXmlWithUnicode); // escapes Unicode characters
    }
}
like image 20
Dmitriy Popov Avatar answered Oct 03 '22 01:10

Dmitriy Popov


In times of UTF-8, XML documents having readable characters is sometimes preferred. This should work, and the recomposition of the String only happens once.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

private static final Pattern ESCAPE_XML_CHARS = Pattern.compile("[\"&'<>]");

public static String escapeXml(String s) {
    Matcher m = ESCAPE_XML_CHARS.matcher(s);
    StringBuffer buf = new StringBuffer();
    while (m.find()) {
        switch (m.group().codePointAt(0)) {
            case '"':
                m.appendReplacement(buf, "&quot;");
            break;
            case '&':
                m.appendReplacement(buf, "&amp;");
            break;
            case '\'':
                m.appendReplacement(buf, "&apos;");
            break;
            case '<':
                m.appendReplacement(buf, "&lt;");
            break;
            case '>':
                m.appendReplacement(buf, "&gt;");
            break;
        }
    }
    m.appendTail(buf);
    return buf.toString();
}
like image 42
Matthias Ronge Avatar answered Oct 03 '22 00:10

Matthias Ronge