Fixing unescaped XML entities in Java with Regex?

Question

I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.

The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &

If & is already there, I don't want to change it to &amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.

Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.

Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.

I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?

Remember: there could be multiple replacements per line.

(I'm aware of this question, but it does not provide the answer that I am looking for.)

Ben Hocking · Accepted Answer

Here's the regex you're looking for: &([^;\W]*([^;\w]|$)), and the corresponding replacement string would be &$1. It matches on &, followed by zero or more non-semicolons or word breaks (it needs to allow zero to match the stand-alone ampersand), followed by a word break that is not a semicolon (or a line end). The capturing group allows you to do the replacement with & that you're looking for.

Here's some sample code using it:

String s = "&amp; & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\W]*([^;\w]|$))";
final String replacement = "&amp;$1";
final String t = s.replaceAll(regex, replacement);

After running this in a sandbox, I get the following result for t:

&amp; &amp; &nsbp; &amp;tc., &amp;tc. &amp;tc

As you can see, the original & and   remain unchanged. However, if you try it with "&&", you get &&, and if you try it with "&&&", you get &&&, which I take as a symptom of the look-ahead problem you were alluding to. However, if you replace the line:

final String t = s.replaceAll(regex, replacement);

with:

final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);

It works with all of those strings and any others that I could think of. (In a finished product, you'd presumably write a single routine that would do this double replaceAll invocation.)

jacobq · Answer

I think you can also use look-ahead to see if & characters are followed by characters & a semicolon (e.g. &(?!\w+;)). Here's an example:

import java.util.*;
import java.util.regex.*;

public class HelloWorld{
    private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\d+|\w+);)");
     public static void main(String []args){
        for (String s : Arrays.asList(
            "http://www.example.com/?a=1&b=2&amp;c=3/",
            "Three in a row: &amp;&&amp;",
            "&lt; is <, &gt; is >, &apos; is ', etc."
        )) {
            System.out.println(
                UNESCAPED_AMPERSAND.matcher(s).replaceAll("&amp;")
            );        
        }
     }
}

// Output:
// http://www.example.com/?a=1&amp;b=2&amp;c=3/
// Three in a row: &amp;&amp;&amp;
// &lt; is <, &gt; is >, &apos; is ', etc.

parsifal · Answer

Start by understanding the grammar around entities: http://www.w3.org/TR/xml/#NT-EntityRef

Then look at the JavaDoc for FilterInputStream: http://download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

Then implement one that reads the actual input character-by-character. When it sees an ampersand, it switches into "entity mode" and looks for a valid entity reference (& Name ;). If it finds one before the first character that isn't allowed in Name, then it writes it to the output verbatim. Otherwise it writes & followed by everything after the ampersand.

Fixing unescaped XML entities in Java with Regex?

Tags:

java

regex

xml

entities

automata

Greg Mattes

3 Answers

Ben Hocking

jacobq

parsifal

Recent Activity

Donate For Us

Fixing unescaped XML entities in Java with Regex?

Tags:

java

regex

xml

entities

automata

Greg Mattes

3 Answers

Ben Hocking

jacobq

parsifal

Related questions

Recent Activity

Donate For Us