Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i escape special characters with using DOM

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.

I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.

Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,

for d'ex i get d&amp;apos;ex but i need d&apos;ex

It makes sense since, DOM escapes "&". But obviously this is not what i need.

So if DOM is already capable of escaping "&" to "&amp;" how can i make it escape other characters like " to &quot; ?

If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?

This is how i escape the special characters i used apache StringEscapeUtils class:

public String xMLTransform() throws Exception
      {

         String xmlfile = FileUtils.readFileToString(new File(filepath));

         DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
         DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
         Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));

       NodeList nodeList = doc.getElementsByTagName("*");

       for (int i = 0; i < nodeList.getLength(); i++) {
          Node currentNode = nodeList.item(i);
          if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
              Node child = currentNode.getFirstChild();
              while(child != null) {
                  if (child.getNodeType() == Node.TEXT_NODE) {                   
                    child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.


                  }
                  child = child.getNextSibling();
              }
          }
      }

         TransformerFactory transformerFactory = TransformerFactory.newInstance();

       Transformer transformer = transformerFactory.newTransformer();
         DOMSource source = new DOMSource(doc);
         StringWriter writer = new StringWriter();
         StreamResult result = new StreamResult(writer);
         transformer.transform(source, result);


         FileOutputStream fop = null;
         File file;

         file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");

         fop = new FileOutputStream(file);

         String xmlString = writer.toString();
         byte[] contentInBytes = xmlString.getBytes();

         fop.write(contentInBytes);
         fop.flush();
         fop.close();

      return file.getPath();


      }
like image 735
Undisputed007 Avatar asked Jul 20 '16 08:07

Undisputed007


People also ask

How do you bypass special characters?

To search for a special character that has a special function in the query syntax, you must escape the special character by adding a backslash before it, for example: To search for the string "where?", escape the question mark as follows: "where\?"

How do you escape a special character in JavaScript?

To use a special character as a regular one, prepend it with a backslash: \. . That's also called “escaping a character”.

Can I escape HTML special chars in JavaScript?

String − We can pass any HTML string as an argument to escape special characters and encode it.


1 Answers

I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.

I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:

@Test
    public void testXSLTTransforms () throws Exception {
        DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
        Document doc = docBuilder.newDocument();
        Element el = doc.createElement("Container");
        doc.appendChild(el);


        Text e = doc.createTextNode("Character");
        el.appendChild(e);
        //e.setNodeValue("\'");
        //e.setNodeValue("\"");

        e.setNodeValue("&");



        TransformerFactory transformerFactory = TransformerFactory.newInstance();       
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");        
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");


        DOMSource source = new DOMSource(doc);
        StreamResult result = new StreamResult(System.out);
        //This prints the original document to the command line.
        transformer.transform(source, result);

        InputStream xsltStream =  getClass().getResourceAsStream("/characterswap.xslt");
            Source xslt = new StreamSource(xsltStream);
            transformer = transformerFactory.newTransformer(xslt);
            //This one is the one you'd pipe to a file
            transformer.transform(source, result);
    }

And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:

characterswap.xslt

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
 <xsl:text> &#xa;  Original VALUE :  </xsl:text>
     <xsl:copy-of select="."/>
     <xsl:text> &#xa;  OUTPUT ESCAPING DISABLED :  </xsl:text>
      <xsl:value-of select="." disable-output-escaping="yes"/>
      <xsl:text> &#xa;  OUTPUT ESCAPING ENABLED :  </xsl:text>
      <xsl:value-of select="." disable-output-escaping="no"/>
 </xsl:template>

</xsl:stylesheet>

And the console out is pretty basic:

<?xml version="1.0" encoding="UTF-8"?>
<Container>&amp;</Container>

  Original VALUE :  <Container>&amp;</Container> 
  OUTPUT ESCAPING DISABLED :  & 
  OUTPUT ESCAPING ENABLED :  &amp;

You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.

XSLT string replace is a good place to start.

This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.

Best of luck.


I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.

Along those lines, if we take your current node text transformation:

if (child.getNodeType() == Node.TEXT_NODE) {
    child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}

And explicitly expect that we want the HTML Encoding:

if (child.getNodeType() == Node.TEXT_NODE) {
    //Capture the current node value
    String nodeValue = child.getNodeValue();
    //Decode for XML10 to remove existing escapes
    String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
    //Then Re-encode for HTML (3/4/5)
    String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
    //String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
    //String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);

    //Then place the fully-encoded HTML back to the node
    child.setNodeValue(fullyEncodedHTML);
}

I would think that the xml would now be fully encoded with all of the HTML escapes you were wanting.

Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.

I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.

In theory, that seems like it would fulfill my understanding of your objective.

Caveat again is that I'm weak with XSLT, so the example xslt file may still need some tweaking. This solution reduces that unknown work quantity, in my opinion.

like image 55
Jeremiah Avatar answered Oct 16 '22 14:10

Jeremiah