Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing XML with < and >

Tags:

c#

regex

xml

I'm attempting to strip down some XML and get only the value related to a field, however the XML does not use the less than and greater than signs. I try to substring around the field name (in the below case it is Date) and this works fine.

    <my:Date xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2014-07-27T23:04:34">2014-08-15</my:Date>

However, I am unable to substring around the less than and greater than. My code is as follows:

public string processReportXML(string field, string xml)
    {
        try
        {
            string result = xml.Substring(xml.IndexOf(field));
            int resultIndex = result.LastIndexOf(field);
            if (resultIndex != -1) result = result.Substring(0, resultIndex);

            result = result.Substring(result.IndexOf(">"));
            resultIndex = result.IndexOf("<");
            if (resultIndex != -1) result = result.Substring(0, resultIndex);

            return field + ": " + result.Substring(4) + "\n";
        }
        catch (Exception e)
        {
            return field + " failed\n";
        }
    }

I have tried in a test project and it works fine but I always get the index should be greater than 0 in my actual web service. I have also tried using regex to replace the characters but this also didn't work.

result = Regex.Replace(result, "&(?!(amp|apos|quot|lt|gt);)", "hidoesthiswork?");
like image 638
Zoosmell Avatar asked Dec 14 '22 20:12

Zoosmell


1 Answers

You have HTML-encoded data.

Add this at the beginning of your method for a simple solution:

xml = HttpUtility.HtmlDecode(xml);

You can also use WebUtility.HtmlDecode if you're using .NET 4.0+ as in this answer

In the long term, you should really be using an XML parser or something like LINQ-XML to access this data. Regexes are not an appropriate tool for this sort of structured data.

like image 173
Codeman Avatar answered Feb 06 '23 02:02

Codeman