I'm looking to convert an html block that contains html named entities to an xml compliant block that uses numbered xml entities while leaving all html tag elements in place.
This is the basic idea illustrated via test:
@Test
public void testEvalHtmlEntitiesToXmlEntities() {
String input = "<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>";
String actual = SomeUtil.eval(input);
Assert.assertEquals(expected, actual);
}
Is anyone aware of a Class that provides this functionality? I can write a regex to iterate through non element matches and do:
xlmString += StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString));
but hoped there is an easier way or a Class that already provides this.
This is what I wound up using. Seems to work fine:
/**
* Some helper methods for XHTML => HTML manipulation
*
* @author David Maple<[email protected]>
*
*/
public class XhtmlUtil {
private static final Pattern ENTITY_PATTERN = Pattern.compile("(&[^\\s]+?;)");
/**
* Don't instantiate me
*/
private XhtmlUtil() { }
/**
* Convert a String of HTML with named HTML entities to the
* same String with entities converted to numbered XML entities
*
* @param html
* @return xhtml
*/
public static String htmlToXmlEntities(String html) {
StringBuffer stringBuffer = new StringBuffer();
Matcher matcher = ENTITY_PATTERN.matcher(html);
while (matcher.find()) {
String replacement = htmlEntityToXmlEntity(matcher.group(1));
matcher.appendReplacement(stringBuffer, "");
stringBuffer.append(replacement);
}
matcher.appendTail(stringBuffer);
return stringBuffer.toString();
}
/**
* Replace an HTML entity with an XML entity
*
* @param htmlEntity
* @return xmlEntity
*/
private static String htmlEntityToXmlEntity(String html) {
return StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(html));
}
}
and the corresponding tests:
public class XhtmlUtilTest {
@Test
public void testEvalXmlEscape() {
String input = "link 1 | link2 & & dkdk;";
String expected = "link 1  |  link2 & & dkdk;";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
@Test
public void testEvalXmlEscape2() {
String input = "<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
@Test
public void testEvalXmlEscapeMultiLine() {
String input = "<a href=\"test.html\">link </a>\n<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>\n<a href=\"test.html\">link </a>";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
}
Have you tried with JTidy?
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setPrintBodyOnly(true); // only print the content
tidy.setXmlOut(true); // to XML
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Although I think it will repair some of your HTML code in case has something.
Here is another solution that I use
/**
* Converts the specified string which is in ASCII format to legal XML
* format. Inspired by XMLWriter by http://www.megginson.com/Software/
*/
public static String convertAsciiToXml(String string) {
if (string == null || string.equals(""))
return "";
StringBuffer sbuf = new StringBuffer();
char ch[] = string.toCharArray();
for (int i = 0; i < ch.length; i++) {
switch (ch[i]) {
case '&':
sbuf.append("&");
break;
case '<':
sbuf.append("<");
break;
case '>':
sbuf.append(">");
break;
case '\"':
sbuf.append(""");
break;
default:
if (ch[i] > '\u007f') {
sbuf.append("&#");
sbuf.append(Integer.toString(ch[i]));
sbuf.append(';');
}
else if (ch[i] == '\t') {
sbuf.append(' ');
sbuf.append(' ');
sbuf.append(' ');
sbuf.append(' ');
}
else if ((int) ch[i] >= 32 || (ch[i] == '\n' || ch[i] == '\r')) {
sbuf.append(ch[i]);
}
}
}
return sbuf.toString();
}
If you already have commons-lang on your classpath, look into the arrays in EntityArrays
; they contain the mapping for all the entities.
To get the numeric value, just use codePointAt(0)
on the first element (the Unicode character).
Now you need a regex-based loop to search for &[^;]+;
. This is pretty safe since &
is a special character which needs to be escaped. If you need to be 100% sure, look for CDATA elements and ignore them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With