Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to speed up this code that finds data changes in two XML files?

The following code compares two XML texts and returns a collection of data changes between them.

This code works fine but needs to be as resource-friendly as possible.

Is there a faster way to do this in LINQ, e.g. without creating the two collections of XElements and comparing each of their fields for differences?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;

namespace TestXmlDiff8822
{
    class Program
    {
        static void Main(string[] args)
        {
            XDocument xdoc1 = XDocument.Parse(GetXml1());
            XDocument xdoc2 = XDocument.Parse(GetXml2());

            List<HistoryFieldChange> hfcList = GetHistoryFieldChanges(xdoc1, xdoc2);

            foreach (var hfc in hfcList)
            {
                Console.WriteLine("{0}: from {1} to {2}", hfc.FieldName, hfc.ValueBefore, hfc.ValueAfter);  
            }

            Console.ReadLine();
        }

        static public List<HistoryFieldChange> GetHistoryFieldChanges(XDocument xdoc1, XDocument xdoc2)
        {
            List<HistoryFieldChange> hfcList = new List<HistoryFieldChange>();

            var elements1 = from e in xdoc1.Root.Elements()
                           select e;

            var elements2 = from e in xdoc2.Root.Elements()
                           select e;

            for (int i = 0; i < elements1.Count(); i++)
            {
                XElement element1 = elements1.ElementAt(i);
                XElement element2 = elements2.ElementAt(i);

                if (element1.Value != element2.Value)
                {
                    HistoryFieldChange hfc = new HistoryFieldChange();
                    hfc.EntityName = xdoc1.Root.Name.ToString();
                    hfc.FieldName = element1.Name.ToString();
                    hfc.KindOfChange = "fieldDataChange";
                    hfc.ObjectReference = (xdoc1.Descendants("Id").FirstOrDefault()).Value;
                    hfc.ValueBefore = element1.Value;
                    hfc.ValueAfter = element2.Value;
                    hfcList.Add(hfc);
                }
            }

            return hfcList;
        }

        public static string GetXml1()
        {
            return @"
<Customer>
    <Id>111</Id>
    <FirstName>Sue</FirstName>
    <LastName>Smith</LastName>
</Customer>
";
        }



        public static string GetXml2()
        {
            return @"
<Customer>
    <Id>111</Id>
    <FirstName>Sue2</FirstName>
    <LastName>Smith-Thompson</LastName>
</Customer>
";
        }
    }

    public class HistoryFieldChange
    {
        public string EntityName { get; set; }
        public string FieldName { get; set; }
        public string ObjectReference { get; set; }
        public string KindOfChange { get; set; }
        public string ValueBefore { get; set; }
        public string ValueAfter { get; set; }
    }
}
like image 716
Edward Tanguay Avatar asked Dec 13 '22 00:12

Edward Tanguay


2 Answers

You should be able to get all the elements that have different values using a linq join on the element name.

var name = xdoc1.Root.Name.ToString();
var id = (xdoc1.Descendants("Id").FirstOrDefault()).Value;

var diff =  from o in xdoc1.Root.Elements()
        join n in xdoc2.Root.Elements() on o.Name equals n.Name
        where o.Value != n.Value
        select new HistoryFieldChange() {
                EntityName = name,
                FieldName = o.Name.ToString(),
                KindOfChange = "fieldDataChange",
                ObjectReference = id,
                ValueBefore = o.Value,
                ValueAfter = n.Value,
        };

One of the advantages to this method is that it's easy to parallelize for multicore machines, just use PLinq and the AsParallel extension method.

var diff =  from o in xdoc1.Root.Elements()
        join n in xdoc2.Root.Elements().AsParallel() on o.Name equals n.Name
        where o.Value != n.Value
        ...

Voila, if the query can be parallelized on your computer then PLinq will automatically handle it. This would speed up large documents, but if your documents are small you may get a better speedup by parallelizing the outer loop that calls GetHistoryFieldChanges using something like Parallel.For.

Another advantage is that you can simply return IEnumerable from GetHistoryFieldChanges, not need to waste time allocating a List, the items will be returned as they're enumerated, and the Linq query will not be executed until then.

IEnumerable<HistoryFieldChange> GetHistoryFieldChanges(...)

Here are times for 1M iterations of the original, Yannick's In-order, and My non-parallel Linq-only implementations. Run on my 2.8ghz laptop with this code.

Elapsed Orig    3262ms
All Linq        1761ms
In Order Only   2383ms

One interesting thing I noticed... Run the code in debug mode and then release mode, it's amazing how much the compiler can optimize the pure Linq version. I think returning IEnumerable helps the compiler a lot here.

like image 184
joshperry Avatar answered Feb 14 '23 17:02

joshperry


Edit:

I must admit that I didn't expect the LINQ implementation would be faster than the naive iterative approach with in order data. But that goes to show that the performance bottleneck isn't always in the obvious places.

As I have said, I am assuming you are comparing in order for a reason, so perhaps this implementation can still be of use. Granted it is not as legible as joshperry's implementation. But in terms of performance it should take the crown.

static public IEnumerable<HistoryFieldChange> GetHistoryFieldChanges2(XDocument xdoc1, XDocument xdoc2)
{
  string id = xdoc1.Descendants("Id").FirstOrDefault().Value;
  string name = xdoc1.Root.Name.ToString();

  IEnumerator<XElement> enumerator1 = xdoc1.Root.Elements().GetEnumerator();
  IEnumerator<XElement> enumerator2 = xdoc2.Root.Elements().GetEnumerator();

  for (; enumerator1.MoveNext() && enumerator2.MoveNext(); )
  {
    XElement element1 = enumerator1.Current;
    XElement element2 = enumerator2.Current;

    if (element1.Value != element2.Value)
    {
      HistoryFieldChange hfc = new HistoryFieldChange();
      hfc.EntityName = name;
      hfc.FieldName = element1.Name.ToString();
      hfc.KindOfChange = "fieldDataChange";
      hfc.ObjectReference = id;
      hfc.ValueBefore = element1.Value;
      hfc.ValueAfter = element2.Value;
      yield return hfc;
    }
  }
}
like image 30
Yannick Motton Avatar answered Feb 14 '23 19:02

Yannick Motton