Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LINQ Select Distinct while ignoring the XML field

I have a complex LINQ query (using LINQ 2 EF) that can return duplicate results and I'm thus using the .Distinct() method to avoid duplicates. Here's the skeleton:

var subQuery1 = // one query...
var subQuery2 = // another query...
var result = subQuery1.Distinct().Union( subQuery2.Distinct() ).ToArray();

Each of the sub queries join a common user table with another table and perform a 'where' query, the results are later combined in the .Union(...). This worked fine until the table was modified to include an XML column, which results in this exception:

the xml data type cannot be selected as distinct because it is not comparable

In this case I don't care if the XML column is equivalent across the results. actually I only need to be assured that the primary key UserId is distinct in the results.

Is there a way to use Distinct() but ignore the XML column or a simpler way to assure that I remove records from the result with the same UserId in an efficient way? Ideally this would not retrieve duplicate records from the database and would not require post-processing to remove the duplicates.

Update: I've found out that if I serialize my queries to arrays ahead of time then there is no need for any kind of comparer since Linq2Objects doesn't have the XML distinct selection issue. For example I can do this:

var subQuery1 = // one query...
var subQuery2 = // another query...
var result = 
   subQuery1.Distinct().ToArray().Union( 
      subQuery2.Distinct().ToArray() )
   .ToArray();

So what I'm really looking for is a way to avoid serializing the intermediate queries and do a Linq2Entities call directly that will not fetch records with duplicate UserIds. Thanks for all the answers thus far.

like image 428
TJB Avatar asked Oct 14 '10 02:10

TJB


3 Answers

Write an IEqualityComparer<T> implementation for the object which contains your XML type and pass it to Distinct. In the Equals method you can implement equality semantics however you wish.

This is a handy T4 code-generation template I wrote myself for generating IEqualityComparer<T> implementations for my team's domain models:

<#@ template language="C#v3.5" debug="True" #>
<#@ output extension=".generated.cs" #>
<#
    var modelNames = new string[] {
        "ClassName1",
        "ClassName2",
        "ClassName3",
    };

    var namespaceName = "MyNamespace";
#>
using System;
using System.Collections.Generic;

namespace <#= namespaceName #>
{
<#
    for (int i = 0; i < modelNames.Length; ++i)
    {
        string modelName = modelNames[i];
        string eqcmpClassName = modelName + "ByIDEqualityComparer";
#>
    #region <#= eqcmpClassName #>

    /// <summary>
    /// Use this EqualityComparer class to determine uniqueness among <#= modelName #> instances
    /// by using only checking the ID property.
    /// </summary>
    [System.Diagnostics.DebuggerNonUserCode]
    public sealed partial class <#= eqcmpClassName #> : IEqualityComparer<<#= modelName #>>
    {
        public bool Equals(<#= modelName #> x, <#= modelName #> y)
        {
            if ((x == null) && (y == null)) return true;
            if ((x == null) || (y == null)) return false;

            return x.ID.Equals(y.ID);
        }

        public int GetHashCode(<#= modelName #> obj)
        {
            if (obj == null) return 0;

            return obj.ID.GetHashCode();
        }
    }

    #endregion
<#
        if (i < modelNames.Length - 1) WriteLine(String.Empty);
    } // for (int i = 0; i < modelNames.Length; ++i)
#>
}

It makes the assumption that each of your model classes have a property named "ID" which is the primary key, stored as something that implements Equals. Our convention forces all of our models to have this property. If your models all have differently-named ID properties, consider modifying either this T4 template to suit your needs or better yet, make life easier on yourself (not just for the sake of using this T4) and change your models to use the "ID" name.

like image 192
James Dunne Avatar answered Nov 02 '22 01:11

James Dunne


as James Dunne said, you'd want to use an IEqualityComparer

a quick mock up would be something like this. You'll need to replace "ObjectType" with whatever type is in your subQuery1 and subQuery2 of course. please note this is untested:

List<ObjectType> listQueries = new List<ObjectType>();

ObjectTypeEqualityComparer objectTypeComparer = new ObjectTypeEqualityComparer();

listQueries.AddRange(subQuery1);// your first query
listQueries.AddRange(subQuery2); // your second query
ObjectType[] result = listQueries.Distinct(objectTypeComparer).ToArray();


class ObjectTypeEqualityComparer : IEqualityComparer<ObjectType>
{
    public bool Equals(ObjectType obj1, ObjectType obj2)
    {
        return obj1.UserId == obj2.UserId ?  true : false;
    }

    public int GetHashCode(ObjectType obj)
    {
        return obj.UserId.GetHashCode();
    }

}
like image 2
jb. Avatar answered Nov 02 '22 00:11

jb.


This extension method should return a list of items with only the first item from each set of duplicates in it...

public static IEnumerable<Tsource> RemoveDuplicates<Tkey, Tsource>(this IEnumerable<Tsource> source, Func<Tsource, Tkey> keySelector)
{
    var hashset = new HashSet<Tkey>();
    foreach (var item in source)
    {
        var key = keySelector(item);
        if (hashset.Add(key))
            yield return item;
    }
}

it would be used on a list like this list.RemoveDuplicates(x => x.UserID). If there were two records in List with the same userID, it would only return the first

like image 1
TerrorAustralis Avatar answered Nov 01 '22 23:11

TerrorAustralis