Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - convert named html entities to numbered xml entities

I'm looking to convert an html block that contains html named entities to an xml compliant block that uses numbered xml entities while leaving all html tag elements in place.

This is the basic idea illustrated via test:

@Test
public void testEvalHtmlEntitiesToXmlEntities() {
    String input = "<a href=\"test.html\">link&nbsp;</a>";
    String expected = "<a href=\"test.html\">link&#160;</a>";
    String actual = SomeUtil.eval(input);
    Assert.assertEquals(expected, actual);
}

Is anyone aware of a Class that provides this functionality? I can write a regex to iterate through non element matches and do:

xlmString += StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString));

but hoped there is an easier way or a Class that already provides this.

like image 814
Dave Maple Avatar asked May 03 '12 20:05

Dave Maple


4 Answers

This is what I wound up using. Seems to work fine:

/**
 * Some helper methods for XHTML => HTML manipulation
 * 
 * @author David Maple<[email protected]>
 *
 */
public class XhtmlUtil {

    private static final Pattern ENTITY_PATTERN = Pattern.compile("(&[^\\s]+?;)");

    /**
     * Don't instantiate me
     */
    private XhtmlUtil() { } 

    /**
     * Convert a String of HTML with named HTML entities to the 
     * same String with entities converted to numbered XML entities 
     * 
     * @param html
     * @return xhtml
     */
    public static String htmlToXmlEntities(String html) {
        StringBuffer stringBuffer = new StringBuffer();
        Matcher matcher = ENTITY_PATTERN.matcher(html);

        while (matcher.find()) {
            String replacement = htmlEntityToXmlEntity(matcher.group(1));
            matcher.appendReplacement(stringBuffer, "");
            stringBuffer.append(replacement);
        }

        matcher.appendTail(stringBuffer);
        return stringBuffer.toString();
    }

    /**
     * Replace an HTML entity with an XML entity
     * 
     * @param htmlEntity
     * @return xmlEntity
     */
    private static String htmlEntityToXmlEntity(String html) {
        return StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(html));
    }

}

and the corresponding tests:

public class XhtmlUtilTest {

    @Test
    public void testEvalXmlEscape() {
        String input = "link 1 &nbsp;|&nbsp; link2 &amp; & dkdk;";
        String expected = "link 1 &#160;|&#160; link2 &amp; & dkdk;";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

    @Test
    public void testEvalXmlEscape2() {
        String input = "<a href=\"test.html\">link&nbsp;</a>";
        String expected = "<a href=\"test.html\">link&#160;</a>";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

    @Test
    public void testEvalXmlEscapeMultiLine() {
        String input = "<a href=\"test.html\">link&nbsp;</a>\n<a href=\"test.html\">link&nbsp;</a>";
        String expected = "<a href=\"test.html\">link&#160;</a>\n<a href=\"test.html\">link&#160;</a>";
        String actual = XhtmlUtil.htmlToXmlEntities(input);
        System.out.println(actual);
        Assert.assertEquals(expected, actual);
    }

}
like image 115
Dave Maple Avatar answered Nov 15 '22 15:11

Dave Maple


Have you tried with JTidy?

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setPrintBodyOnly(true); // only print the content
    tidy.setXmlOut(true); // to XML
    tidy.setSmartIndent(true); 
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Although I think it will repair some of your HTML code in case has something.

like image 29
Paul Vargas Avatar answered Nov 15 '22 16:11

Paul Vargas


Here is another solution that I use

 /**
     * Converts the specified string which is in ASCII format to legal XML
     * format. Inspired by XMLWriter by http://www.megginson.com/Software/
     */
    public static String convertAsciiToXml(String string) {
        if (string == null || string.equals(""))
            return "";

        StringBuffer sbuf = new StringBuffer();
        char ch[] = string.toCharArray();
        for (int i = 0; i < ch.length; i++) {
            switch (ch[i]) {
                case '&':
                    sbuf.append("&amp;");
                    break;
                case '<':
                    sbuf.append("&lt;");
                    break;
                case '>':
                    sbuf.append("&gt;");
                    break;
                case '\"':
                    sbuf.append("&quot;");
                    break;
                default:
                    if (ch[i] > '\u007f') {
                        sbuf.append("&#");
                        sbuf.append(Integer.toString(ch[i]));
                        sbuf.append(';');
                    }
                    else if (ch[i] == '\t') {
                        sbuf.append(' ');
                        sbuf.append(' ');
                        sbuf.append(' ');
                        sbuf.append(' ');
                    }
                    else if ((int) ch[i] >= 32 || (ch[i] == '\n' || ch[i] == '\r')) {
                        sbuf.append(ch[i]);
                    }
            }
        }
        return sbuf.toString();
    }
like image 22
Greg Avatar answered Nov 15 '22 15:11

Greg


If you already have commons-lang on your classpath, look into the arrays in EntityArrays; they contain the mapping for all the entities.

To get the numeric value, just use codePointAt(0) on the first element (the Unicode character).

Now you need a regex-based loop to search for &[^;]+;. This is pretty safe since & is a special character which needs to be escaped. If you need to be 100% sure, look for CDATA elements and ignore them.

like image 45
Aaron Digulla Avatar answered Nov 15 '22 17:11

Aaron Digulla