Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

org.xml.sax.SAXParseException: The reference to entity "T" must end with the ';' delimiter

Tags:

java

xml

I am trying to parse an XML file whcih contains some special characters like "&" using DOM parser. I am getting the saxparse exception "the reference to entity must end with a a delimiter". Is there any way to overcome this exception, since I can not modify the XML file to remove the special characters, since it is coming from different application. Please suggest a way to parse this XML file to get the root element?

Thanks in advance

This the part of the XML which I am parsing

<P>EDTA/THAM WASH 
</P>

<P>jhc ^ 72. METER SOLVENT: Meter 21 LITERS of R. O. WATER through the add line into 
FT-250. Start agitator. 
</P>

<P>R. O. WATER &lt;ZLl LITERS </P>

<P>•     NOTE: The following is a tool control operation. The area within 10 feet of any open vessel or container is under tool control. </P>

<P>-af . 73. CHARGE SOLIDS: Remove any unnecessary items from the tool controlled area. Indicate the numbers of each item that will remain in the tool controlled area during the operation in the IN box of the Tool Control Log. </P>

<P>^___y_ a. To minimize the potential for cross contamination, confirm that no other solids are being charged or packaged in adjacent equipment. </P>

<P>kk k WARNING: Wear protective gloves, air jacket and use local exhaust when handling TROMETHAMINE USP (189400) (THAM) (K-l--Irritant!). The THAM may be dusty. </P>

<P>-&lt;&amp;^b .   Charge 2.1 KG of TROMETHAMINE USP (189400) (THAM) into FT-250 through the top. </P>

<P>TROMETHAMINE USP (189400) (THAM) </P>

<P>Scale ID:     / / 7S </P>

<P>LotNo.:   qy/o^yo^ </P>

<P>Gross:    ^ . S </P>

<P>Tare: 10 ,1 </P>

<P>Net:     J^l </P>

<P>Total:   JL'J </P>

<P><Figure ActualText="&T ">

<ImageData src="images/17PT 07009K_img_1.jpg"/>
&amp;T </Figure>
Checked by </P>
like image 520
vasumathi Avatar asked Dec 22 '09 05:12

vasumathi


4 Answers

As others have stated, your XML is definitely invalid. However, if you can't change the generating application and can add a cleaning step then the following should clean up the XML:

String clean = xml.replaceAll( "&([^;]+(?!(?:\\w|;)))", "&amp;$1" );

What that regex is doing is looking for any badly formed entity references and escaping the ampersand.

Specifically, (?!(?:\\w|;)) is a negative look-ahead that makes that match stop at anything that is not a word character (a-z,0-9) and not a semi-colon. So the whole regex grabs everything from the & that is not a ; up until the first non-word, non-semi-colon character.

It puts everything except the ampersand in the first capture group so that it can be referred to in the replace string. That's the $1.

Note that this won't fix references that look like they are valid but aren't. For example, if you had &T; that would throw a different kind of error altogether unless the XML actually defines the entity.

like image 91
PSpeed Avatar answered Nov 07 '22 10:11

PSpeed


I'm not sure I understand the question. As far as I'm aware, unless you're inside a CDATA, naked & characters without a closing ; are invalid.

If that's not the case for your XML file, then it's invalid, and you'll need to find another way of parsing it, or fixing it before SAX gets a hold of it.

If I'm misunderstanding something here, you should probably post a sample of the actual XML so we can hep further.

Update:

It looks like:

Figure ActualText="&T "

is the offending line. Is this section within a CDATA or not? If not, this is not valid XML and you should not expect SAX to be able to handle it.

You'll need to either:

  • change the application that created it; or
  • fix it before it's loaded by SAX (if you can't change that application) to something like "Figure ActualText="&amp;T ""; or
  • find a non-SAX method for parsing.
like image 44
paxdiablo Avatar answered Nov 07 '22 10:11

paxdiablo


Some of you might be familiar with the ERROR “The reference to entity XX must end with the ‘;’ delimiter” while adding or altering any piece of code to your XML Templates. Even I get that ERROR sometimes when I try to alter or add some codes to my blogger blog’s templates(XML).

Mostly these kind of ERRORS occur while we add any third-party banner or widgets to our XML Templates. We can easily rectify that ERROR by making a slight alteration in the piece of code we add!

Just replace “&” with “&amp;” in your HTML/Javascript code!

EXAMPLE

Original Code:
<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&br=XXX&dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>

Altered Code:

<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&amp;br=XXX&amp;dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>
like image 9
Ranadheer Reddy Avatar answered Nov 07 '22 10:11

Ranadheer Reddy


Simply replace your & with &amp; and it will work.

like image 3
L01c Avatar answered Nov 07 '22 10:11

L01c