Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing unescaped XML entities in Java with Regex?

I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.

The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &

If &amp; is already there, I don't want to change it to &amp;amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.

Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.

Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.

I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?

Remember: there could be multiple replacements per line.

(I'm aware of this question, but it does not provide the answer that I am looking for.)

like image 329
Greg Mattes Avatar asked Jul 11 '11 18:07

Greg Mattes


3 Answers

Here's the regex you're looking for: &([^;\\W]*([^;\\w]|$)), and the corresponding replacement string would be &amp;$1. It matches on &, followed by zero or more non-semicolons or word breaks (it needs to allow zero to match the stand-alone ampersand), followed by a word break that is not a semicolon (or a line end). The capturing group allows you to do the replacement with &amp; that you're looking for.

Here's some sample code using it:

String s = "&amp; & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\\W]*([^;\\w]|$))";
final String replacement = "&amp;$1";
final String t = s.replaceAll(regex, replacement);

After running this in a sandbox, I get the following result for t:

&amp; &amp; &nsbp; &amp;tc., &amp;tc. &amp;tc

As you can see, the original &amp; and &nbsp; remain unchanged. However, if you try it with "&&", you get &amp;&, and if you try it with "&&&", you get &amp;&&amp;, which I take as a symptom of the look-ahead problem you were alluding to. However, if you replace the line:

final String t = s.replaceAll(regex, replacement);

with:

final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);

It works with all of those strings and any others that I could think of. (In a finished product, you'd presumably write a single routine that would do this double replaceAll invocation.)

like image 155
Ben Hocking Avatar answered Sep 28 '22 08:09

Ben Hocking


I think you can also use look-ahead to see if & characters are followed by characters & a semicolon (e.g. &(?!\w+;)). Here's an example:

import java.util.*;
import java.util.regex.*;

public class HelloWorld{
    private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|\\w+);)");
     public static void main(String []args){
        for (String s : Arrays.asList(
            "http://www.example.com/?a=1&b=2&amp;c=3/",
            "Three in a row: &amp;&&amp;",
            "&lt; is <, &gt; is >, &apos; is ', etc."
        )) {
            System.out.println(
                UNESCAPED_AMPERSAND.matcher(s).replaceAll("&amp;")
            );        
        }
     }
}

// Output:
// http://www.example.com/?a=1&amp;b=2&amp;c=3/
// Three in a row: &amp;&amp;&amp;
// &lt; is <, &gt; is >, &apos; is ', etc.
like image 28
jacobq Avatar answered Sep 28 '22 09:09

jacobq


Start by understanding the grammar around entities: http://www.w3.org/TR/xml/#NT-EntityRef

Then look at the JavaDoc for FilterInputStream: http://download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

Then implement one that reads the actual input character-by-character. When it sees an ampersand, it switches into "entity mode" and looks for a valid entity reference (& Name ;). If it finds one before the first character that isn't allowed in Name, then it writes it to the output verbatim. Otherwise it writes &amp; followed by everything after the ampersand.

like image 29
parsifal Avatar answered Sep 28 '22 08:09

parsifal