Java

Question

I'm looking to convert an html block that contains html named entities to an xml compliant block that uses numbered xml entities while leaving all html tag elements in place.

This is the basic idea illustrated via test:

@Test
public void testEvalHtmlEntitiesToXmlEntities() {
    String input = "<a href=\"test.html\">link&nbsp;</a>";
    String expected = "<a href=\"test.html\">link&#160;</a>";
    String actual = SomeUtil.eval(input);
    Assert.assertEquals(expected, actual);
}

Is anyone aware of a Class that provides this functionality? I can write a regex to iterate through non element matches and do:

xlmString += StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString));

but hoped there is an easier way or a Class that already provides this.

Dave Maple · Accepted Answer

This is what I wound up using. Seems to work fine:

/**
 * Some helper methods for XHTML => HTML manipulation
 * 
 * @author David Maple<d@davemaple.com>
 *
 */
public class XhtmlUtil {

    private static final Pattern ENTITY_PATTERN = Pattern.compile("(&[^\s]+?;)");

    /**
     * Don't instantiate me
     */
    private XhtmlUtil() { } 

    /**
     * Convert a String of HTML with named HTML entities to the 
     * same String with entities converted to numbered XML entities 
     * 
     * @param html
     * @return xhtml
     */
    public static String htmlToXmlEntities(String html) {
        StringBuffer stringBuffer = new StringBuffer();
        Matcher matcher = ENTITY_PATTERN.matcher(html);

        while (matcher.find()) {
            String replacement = htmlEntityToXmlEntity(matcher.group(1));
            matcher.appendReplacement(stringBuffer, "");
            stringBuffer.append(replacement);
        }

        matcher.appendTail(stringBuffer);
        return stringBuffer.toString();
    }

    /**
     * Replace an HTML entity with an XML entity
     * 
     * @param htmlEntity
     * @return xmlEntity
     */
    private static String htmlEntityToXmlEntity(String html) {
        return StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(html));
    }

}

and the corresponding tests:

public class XhtmlUtilTest {

    @Test
    public void testEvalXmlEscape() {
        String input = "link 1 &nbsp;|&nbsp; link2 &amp; & dkdk;";
        String expected = "link 1 &#160;|&#160; link2 &amp; & dkdk;";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

    @Test
    public void testEvalXmlEscape2() {
        String input = "<a href=\"test.html\">link&nbsp;</a>";
        String expected = "<a href=\"test.html\">link&#160;</a>";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

    @Test
    public void testEvalXmlEscapeMultiLine() {
        String input = "<a href=\"test.html\">link&nbsp;</a>
<a href=\"test.html\">link&nbsp;</a>";
        String expected = "<a href=\"test.html\">link&#160;</a>
<a href=\"test.html\">link&#160;</a>";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

}

Paul Vargas · Answer

Have you tried with JTidy?

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setPrintBodyOnly(true); // only print the content
    tidy.setXmlOut(true); // to XML
    tidy.setSmartIndent(true); 
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Although I think it will repair some of your HTML code in case has something.

Greg · Answer

Here is another solution that I use

 /**
     * Converts the specified string which is in ASCII format to legal XML
     * format. Inspired by XMLWriter by http://www.megginson.com/Software/
     */
    public static String convertAsciiToXml(String string) {
        if (string == null || string.equals(""))
            return "";

        StringBuffer sbuf = new StringBuffer();
        char ch[] = string.toCharArray();
        for (int i = 0; i < ch.length; i++) {
            switch (ch[i]) {
                case '&':
                    sbuf.append("&amp;");
                    break;
                case '<':
                    sbuf.append("&lt;");
                    break;
                case '>':
                    sbuf.append("&gt;");
                    break;
                case '\"':
                    sbuf.append("&quot;");
                    break;
                default:
                    if (ch[i] > '\u007f') {
                        sbuf.append("&#");
                        sbuf.append(Integer.toString(ch[i]));
                        sbuf.append(';');
                    }
                    else if (ch[i] == '	') {
                        sbuf.append(' ');
                        sbuf.append(' ');
                        sbuf.append(' ');
                        sbuf.append(' ');
                    }
                    else if ((int) ch[i] >= 32 || (ch[i] == '
' || ch[i] == '
')) {
                        sbuf.append(ch[i]);
                    }
            }
        }
        return sbuf.toString();
    }

Aaron Digulla · Answer

If you already have commons-lang on your classpath, look into the arrays in EntityArrays; they contain the mapping for all the entities.

To get the numeric value, just use codePointAt(0) on the first element (the Unicode character).

Now you need a regex-based loop to search for &[^;]+;. This is pretty safe since & is a special character which needs to be escaped. If you need to be 100% sure, look for CDATA elements and ignore them.

Java - convert named html entities to numbered xml entities

Tags:

html

parsing

xml

entities

Dave Maple

4 Answers

Dave Maple

Paul Vargas

Greg

Aaron Digulla

Recent Activity

Donate For Us

Java - convert named html entities to numbered xml entities

Tags:

java

html

parsing

xml

entities

Dave Maple

4 Answers

Dave Maple

Paul Vargas

Greg

Aaron Digulla

Related questions

Recent Activity

Donate For Us