I got an XML file from 3rd party that I must import in my app, and XML had elements with unescaped & in inner text, and they don't wont to fix that ! So my question is what is the best way to deal with this problem ? This XML is pretty big and that fix has to be fast, my first solution is just replace & character with ampersand but really I don't like this "solution" for obvious reasons. I don't know how to use XmlStringReader with such XML because is throws exception on such lines, so I can't use HtmlEncode on inner text. I tried to set XmlTextReader <code>Settings.CheckCharacters</code> to false but no result. Here is the sample, & is in element, and in that field can be anything that can be in some company name, so my replace fix maybe don't work for some other company name, I would like to use HtmlEncode somehow, but only on inner text of course. <pre class="prettyprint"><code><komitent ID="001398"> <sifra>001398</sifra> <redni_broj>001398</redni_broj> <naziv>LJUBICA & ŽARKO</naziv> <adresa1>Odvrtnica 27</adresa1> <adresa2></adresa2> <drzava>HRVATSKA</drzava> <grad>Zagreb</grad> </komitent> </code></pre>

The key message below is that unless you know the exact format of the input file, and have guarantees that any deviation from XML is consistent, you can't programmatically fix without risking that your fixes will be incorrect. Fixing it by replacing <code>&</code> with <code>&amp;</code> is an acceptable solution if and only if: <ol> <li> There is no acceptable well-formed source of these data. <ul> <li>As @Darin Dimitrov comments, try to find a better provider, or get this provider to fix it.</li> <li>JSON (for example) is preferable to poorly formed XML, even if you aren't using javascript.</li> </ul> </li> <li> This is a one off (or at least extremely infrequent) import. <ul> <li>If you have to fetch this in at runtime, then this solution will not work.</li> </ul> </li> <li> You can keep iterating through, devising new fixes for it, adding a solution to each problem as you come across it. <ul> <li>You will probably find that once you have "fixed" it by escaping <code>&</code> characters, there will be other errors.</li> </ul> </li> <li> You have the resources to manually check the integrity of the "fixed" data. <ul> <li>The errors you "fix" may be more subtle than you realise.</li> </ul> </li> <li> There are no correctly formatted entities in the document - <ul> <li>Simply replacing <code>&</code> with <code>&amp;</code> will erroneously change <code>&quot;</code> to <code>&amp;quot;</code>. You may be able to get around this, but don't be naive about how tricky it might be (entities may be defined in a DTD, may refer to a unicode code-point ...)</li> <li>If it is a particular element that misbehaves, you could consider wrapping the content of the element with <code><![CDATA</code> <code>]]></code>, but that still relies on you being able to find the start and end tags reliably.</li> </ul> </li> </ol>

Fixing bad XML file (eg. unescaped & etc.) [duplicate]

Tags:

.net

xml

xmltextreader

I got an XML file from 3rd party that I must import in my app, and XML had elements with unescaped & in inner text, and they don't wont to fix that ! So my question is what is the best way to deal with this problem ?

This XML is pretty big and that fix has to be fast, my first solution is just replace & character with ampersand but really I don't like this "solution" for obvious reasons. I don't know how to use XmlStringReader with such XML because is throws exception on such lines, so I can't use HtmlEncode on inner text. I tried to set XmlTextReader Settings.CheckCharacters to false but no result.

Here is the sample, & is in element, and in that field can be anything that can be in some company name, so my replace fix maybe don't work for some other company name, I would like to use HtmlEncode somehow, but only on inner text of course.

<komitent ID="001398">
  <sifra>001398</sifra>
  <redni_broj>001398</redni_broj>
  <naziv>LJUBICA & ŽARKO</naziv>
  <adresa1>Odvrtnica 27</adresa1>
  <adresa2></adresa2>
  <drzava>HRVATSKA</drzava>
  <grad>Zagreb</grad>
</komitent>

448

asked May 16 '11 14:05

Antonio Bakula

2 Answers

The key message below is that unless you know the exact format of the input file, and have guarantees that any deviation from XML is consistent, you can't programmatically fix without risking that your fixes will be incorrect.

Fixing it by replacing & with & is an acceptable solution if and only if:

There is no acceptable well-formed source of these data.
- As @Darin Dimitrov comments, try to find a better provider, or get this provider to fix it.
- JSON (for example) is preferable to poorly formed XML, even if you aren't using javascript.
This is a one off (or at least extremely infrequent) import.
- If you have to fetch this in at runtime, then this solution will not work.
You can keep iterating through, devising new fixes for it, adding a solution to each problem as you come across it.
- You will probably find that once you have "fixed" it by escaping & characters, there will be other errors.
You have the resources to manually check the integrity of the "fixed" data.
- The errors you "fix" may be more subtle than you realise.
There are no correctly formatted entities in the document -
- Simply replacing & with & will erroneously change " to &quot;. You may be able to get around this, but don't be naive about how tricky it might be (entities may be defined in a DTD, may refer to a unicode code-point ...)
- If it is a particular element that misbehaves, you could consider wrapping the content of the element with <![CDATA ]]>, but that still relies on you being able to find the start and end tags reliably.

172

answered Oct 23 '22 16:10

Paul Butcher

Start by changing your mindset. The input is not XML, so don't call it XML. Don't even use "xml" to tag your questions about it. The fact that it isn't XML means that you can't use any XML tools with it, and you can't get any of the benefits of XML data interchange. You're dealing with a proprietary format that comes without a specification and without any tools. Treat it as you would any other proprietary format - try to discover a specification for what you are getting, and write a parser for it.

answered Oct 23 '22 18:10

Michael Kay

Related questions
                            
                                Undefined CLR namespace - No solution found
                            
                                Result of adding an int32 to a 64-bit native int?
                            
                                Why no Directory.Copy in C# [closed]
                            
                                Culture-Invariant case-sensitive string comparison returns different results on different machines
                            
                                Random number with fixed average
                            
                                LINQ to SQL: intermittent AccessViolationException wrapped in TargetInvocationException
                            
                                In which cases do I need to create two different extension methods for IEnumerable and IQueryable?
                            
                                Set Item Permissions
                            
                                WCF, ASP.NET Membership Provider and Authentication Service
                            
                                LINQ method for adding items to a dictionary
                            
                                Visual Studio 2008 custom class item template, $safeprojectname$ not reconciling
                            
                                Does C# compile code inside an if(false) block?
                            
                                Best Practices of fault toleration and reliability for scheduled tasks or services
                            
                                will there be an update on MiscUtil for .Net 4?
                            
                                Why do different versions of Silverlight assemblies have the same version number?
                            
                                IO.File.GetLastAccessTime is off by one hour
                            
                                WPF: How to change the CurrentUICulture at runtime
                            
                                Start a new Process that executes a delegate
                            
                                difference between windows installer 3.1 and 4.5 While creating SetUP Project, which one to select when
                            
                                How to stop a UserControl (nee ScrollableControl) from calling ScrollWindow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With