I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.
The (current) problem is that ampersand characters are not always escaped properly, so I need to convert &
into &
If &
is already there, I don't want to change it to &
. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>;
is preserved.
Where <characters>
is some set of characters defining an entity between the initial &
and the closing ;
. In particular, <
and >
are not literals that would otherwise denote an XML element.
Now, when parsing, if I see &<characters>
I don't know whether I'll run into a ;
, a (space), end-of-line, or another
&
. So I think that I have to remember <characters>
as I look ahead for a character that will tell me what to do with the original &
.
I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String)
won't work. Or is there a Java regex that can solve this problem?
Remember: there could be multiple replacements per line.
(I'm aware of this question, but it does not provide the answer that I am looking for.)
Here's the regex you're looking for: &([^;\\W]*([^;\\w]|$))
, and the corresponding replacement string would be &$1
. It matches on &
, followed by zero or more non-semicolons or word breaks (it needs to allow zero to match the stand-alone ampersand), followed by a word break that is not a semicolon (or a line end). The capturing group allows you to do the replacement with &
that you're looking for.
Here's some sample code using it:
String s = "& & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\\W]*([^;\\w]|$))";
final String replacement = "&$1";
final String t = s.replaceAll(regex, replacement);
After running this in a sandbox, I get the following result for t:
& & &nsbp; &tc., &tc. &tc
As you can see, the original &
and
remain unchanged. However, if you try it with "&&", you get &&
, and if you try it with "&&&", you get &&&
, which I take as a symptom of the look-ahead problem you were alluding to. However, if you replace the line:
final String t = s.replaceAll(regex, replacement);
with:
final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);
It works with all of those strings and any others that I could think of. (In a finished product, you'd presumably write a single routine that would do this double replaceAll
invocation.)
I think you can also use look-ahead to see if &
characters are followed by characters & a semicolon (e.g. &(?!\w+;)
). Here's an example:
import java.util.*;
import java.util.regex.*;
public class HelloWorld{
private static final Pattern UNESCAPED_AMPERSAND =
Pattern.compile("&(?!(#\\d+|\\w+);)");
public static void main(String []args){
for (String s : Arrays.asList(
"http://www.example.com/?a=1&b=2&c=3/",
"Three in a row: &&&",
"< is <, > is >, ' is ', etc."
)) {
System.out.println(
UNESCAPED_AMPERSAND.matcher(s).replaceAll("&")
);
}
}
}
// Output:
// http://www.example.com/?a=1&b=2&c=3/
// Three in a row: &&&
// < is <, > is >, ' is ', etc.
Start by understanding the grammar around entities: http://www.w3.org/TR/xml/#NT-EntityRef
Then look at the JavaDoc for FilterInputStream
: http://download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html
Then implement one that reads the actual input character-by-character. When it sees an ampersand, it switches into "entity mode" and looks for a valid entity reference (& Name ;
). If it finds one before the first character that isn't allowed in Name
, then it writes it to the output verbatim. Otherwise it writes &
followed by everything after the ampersand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With