Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I can never predict XMLReader behavior. Any tips on understanding?

Tags:

c#

.net

parsing

xml

It seems every time I use an XMLReader, I end up with a bunch of trial and error trying to figure out what I'm about to read versus what I'm reading versus what I just read. I always figure it out in the end, but I still, after using it numerous times, don't seem to have a firm grasp of what an XMLReader is actually doing when I call the various functions. For example, when I call Read the first time, if it reads an element start tag, is it now at the end of the element tag, or ready to begin reading the element's attributes? Does it know the values of the attributes yet if I call GetAttribute? What will happen if I call ReadStartElement at this point? Will it finish reading the start element, or look for the next one, skipping all the attributes? What if I want to read a number of elements -- what's the best way to try to read the next element and determine what its name is. Will Read followed by IsStartElement work, or will IsStartElement be returning information about the node following the element I just read?

As you can see I really am lacking an understanding of where an XMLReader is at during the various phases of its reading and how its state is affected by various read functions. Is there some simple pattern that I've simply failed to notice?

Here's another example of the problem (taken from the responses):

string input = "<machine code=\"01\">The Terminator" +
   "<part code=\"01a\">Right Arm</part>" +
   "<part code=\"02\">Left Arm</part>" +
   "<part code=\"03\">Big Toe</part>" +
   "</machine>";

using (System.IO.StringReader sr = new System.IO.StringReader(input))
{
   using (XmlTextReader reader = new XmlTextReader(sr))
   {
      reader.WhitespaceHandling = WhitespaceHandling.None;
      reader.MoveToContent();

      while(reader.Read())
      {
         if (reader.Name.Equals("machine") && (reader.NodeType == XmlNodeType.Element))
         {
            Console.Write("Machine code {0}: ", reader.GetAttribute("code"));
            Console.WriteLine(reader.ReadElementString("machine"));
         }
         if(reader.Name.Equals("part") && (reader.NodeType == XmlNodeType.Element))
         {
            Console.Write("Part code {0}: ", reader.GetAttribute("code"));
            Console.WriteLine(reader.ReadElementString("part"));
         }
      }
   }
}

First problem, the machine node is skipped completely. MoveToContent seems to move to the content of the machine element causing it to never be parsed. Furthermore, if you skip MoveToContent, you get an error: "'Element' is an invalid XmlNodeType." trying to ReadElementString, which I can't quite explain.

Next problem is, while reading the first part element, ReadElementString seems to position the reader at the beginning of the next part element after reading. This causes the reader.Read at the beginning of the next loop to skip over the next part element jumping right to the last part element. So the final output of this code is:

Part code 01a: Right Arm

Part code 03: Big Toe

This is a prime example of the confusign behavior of XMLReader that I'm trying to understand.

like image 261
BlueMonkMN Avatar asked Jan 23 '10 23:01

BlueMonkMN


2 Answers

Here's the thing... I've written a fair amount of serialization code (including a lot of xml processing), and I find myself in exactly the same boat as you. I have a very simple piece of guidance, therefore: don't.

I'll happily use XmlWriter as a way to write xml quickly, but I'd walk over hot coals before choosing to implement IXmlSerializable another time - I'd simply write a separate DTO and map the data into that; it also means the schema (for "mex", "wsdl", etc) comes for free.

like image 135
Marc Gravell Avatar answered Sep 19 '22 01:09

Marc Gravell


My latest solution (which works for my current case) is to stick with Read(), IsStartElement(name) and GetAttribute(name) in implementing a state machine.

using (System.Xml.XmlReader xr = System.Xml.XmlTextReader.Create(stm))
{
   employeeSchedules = new Dictionary<string, EmployeeSchedule>();
   EmployeeSchedule emp = null;
   WeekSchedule sch = null;
   TimeRanges ranges = null;
   TimeRange range = null;
   while (xr.Read())
   {
      if (xr.IsStartElement("Employee"))
      {
         emp = new EmployeeSchedule();
         employeeSchedules.Add(xr.GetAttribute("Name"), emp);
      }
      else if (xr.IsStartElement("Unavailable"))
      {
         sch = new WeekSchedule();
         emp.unavailable = sch;
      }
      else if (xr.IsStartElement("Scheduled"))
      {
         sch = new WeekSchedule();
         emp.scheduled = sch;
      }
      else if (xr.IsStartElement("DaySchedule"))
      {
         ranges = new TimeRanges();
         sch.daySchedule[int.Parse(xr.GetAttribute("DayNumber"))] = ranges;
         ranges.Color = ParseColor(xr.GetAttribute("Color"));
         ranges.FillStyle = (System.Drawing.Drawing2D.HatchStyle)
            System.Enum.Parse(typeof(System.Drawing.Drawing2D.HatchStyle),
            xr.GetAttribute("Pattern"));
      }
      else if (xr.IsStartElement("TimeRange"))
      {
         range = new TimeRange(
            System.Xml.XmlConvert.ToDateTime(xr.GetAttribute("Start"),
            System.Xml.XmlDateTimeSerializationMode.Unspecified),
            new TimeSpan((long)(System.Xml.XmlConvert.ToDouble(xr.GetAttribute("Length")) * TimeSpan.TicksPerHour)));
         ranges.Add(range);
      }
   }
   xr.Close();
}

After Read, IsStartElement will return true if you just read a start element (optinally checking the name of the element read), and you can access all the attributes of that element immediately. If all you need to read is elements and attributes, this is pretty straightforward.

Edit The new example posted in the question poses some other challenges. The correct way to read that XML seems to be like this:

using (System.IO.StringReader sr = new System.IO.StringReader(input))
{
   using (XmlTextReader reader = new XmlTextReader(sr))
   {
      reader.WhitespaceHandling = WhitespaceHandling.None;

      while(reader.Read())
      {
         if (reader.Name.Equals("machine") && (reader.NodeType == XmlNodeType.Element))
         {
            Console.Write("Machine code {0}: ", reader.GetAttribute("code"));
            Console.WriteLine(reader.ReadString());
         }
         if(reader.Name.Equals("part") && (reader.NodeType == XmlNodeType.Element))
         {
            Console.Write("Part code {0}: ", reader.GetAttribute("code"));
            Console.WriteLine(reader.ReadString());
         }
      }
   }
}

You have to use ReadString instead of ReadElementString in order to avoid reading the end element and skipping into the beginning of the next element (let the following Read() skip over the end element so it doesn't skip over the next start element). Still this seems somewhat confusing and potentially unreliable, but it works for this case.

After some additional thought, my opinion is that XMLReader is just too confusing if you use any methods to read content other than the Read method. I think it's much simpler if you confine yourself to the Read method to read from the XML stream. Here's how it would work with the new example (once again, it seems IsStartElement, GetAttribute and Read are the key methods, and you end up with a state machine):

while(reader.Read())
{
   if (reader.IsStartElement("machine"))
   {
      Console.Write("Machine code {0}: ", reader.GetAttribute("code"));
   }
   if(reader.IsStartElement("part"))
   {
      Console.Write("Part code {0}: ", reader.GetAttribute("code"));
   }
   if (reader.NodeType == XmlNodeType.Text)
   {
      Console.WriteLine(reader.Value);
   }
}
like image 27
BlueMonkMN Avatar answered Sep 19 '22 01:09

BlueMonkMN