Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LINQ-to-objects index within a group + for different groupings (aka ROW_NUMBER with PARTITION BY equivalent)

After much Google searching and code experimentation, I'm stumped on a complex C# LINQ-to-objects problem which in SQL would be easy to solve with a pair of ROW_NUMBER()...PARTITION BY functions and a subquery or two.

Here's, in words, what I'm trying to do in code-- the underlying requirement is removing duplicate documents from a list:

  1. First, group a list by (Document.Title, Document.SourceId), assuming a (simplified) class definition like this:
    class Document
    {
        string Title;
        int SourceId; // sources are prioritized (ID=1 better than ID=2)
    }
  2. Within that group, assign each document an index (e.g. Index 0 == 1st document with this title from this source, Index 1 = 2nd document with this title from this source, etc.). I'd love the equivalent of ROW_NUMBER() in SQL!

  3. Now group by (Document.Title, Index), where Index was computed in Step #2. For each group, return only one document: the one with the lowest Document.SourceId.

Step #1 is easy (e.g. codepronet.blogspot.com/2009/01/group-by-in-linq.html), but I'm getting stumped on steps #2 and #3. I can't seem to build a red-squiggle-free C# LINQ query to solve all three steps.

Anders Heilsberg's post on this thread is I think the answer to Steps #2 and #3 above if I could get the syntax right.

I'd prefer to avoid using an external local variable to do the Index computation, as recommended on slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html, since that solution breaks if the external variable is modified.

Optimally, the group-by-Title step could be done first, so the "inner" groupings (first by Source to compute the index, then by Index to filter out duplicates) can operate on small numbers of objects in each "by title" group, since the # of documents in each by-title group is usually under 100. I really don't want an N2 solution!

I could certainly solve this with nested foreach loops, but it seems like the kind of problem which should be simple with LINQ.

Any ideas?

like image 601
Justin Grant Avatar asked Jul 25 '09 22:07

Justin Grant


4 Answers

I think jpbochi missed that you want your groupings to be by pairs of values (Title+SourceId then Title+Index). Here's a LINQ query (mostly) solution:

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

First we group by Title+SourceId (I use an anonymous type because the compiler builds a good hashcode for the grouping lookup). Then we use Select to attach the grouped index to the document, which we use in our second grouping. Finally, for each group we pick the lowest SourceId.

Given this input:

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

I get this output:

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

Update: I just saw your question about grouping by Title first. You can do this using a subquery on your Title groups:

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;
like image 124
dahlbyk Avatar answered Nov 10 '22 10:11

dahlbyk


To be honest, I'm quite confused with your question. Maybe if you should explain what you're trying to solve. Anyway, I'll try to answer what I understood.

1) First, I'll assume that you already have a list of documents grouped by Title+SourceId. For testing purposes, I hardcoded a list as follow:

var docs = new [] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
};

2) To get put a index in every item, you can use the Select extension method, passing a Func selector function. Like this:

var docsWithIndex
    = docs
    .Select( (d, i) => new { Doc = d, Index = i } );

3) From what I understood, the next step would be to group the last result by Title. Here's how to do it:

var docsGroupedByTitle
    = docsWithIndex
    .GroupBy( a => a.Doc.Title );

The GroupBy function (used above) returns an IEnumerable<IGrouping<string,DocumentWithIndex>>. Since a group is enumerable too, we now have an enumerable of enumerables.

4) Now, for each of the groups above, we'll get only the item with the minimum SourceId. To make this operation we'll need 2 levels of recursion. In LINQ, the outer level is a selection (for each group, get one of its items), and the inner level is an aggregation (get the item with the lowest SourceId):

var selectedFew
    = docsGroupedByTitle
    .Select(
        g => g.Aggregate(
            (a, b) => (a.Doc.SourceId  <= b.Doc.SourceId) ? a : b
        )
    );

Just to ensure that it works, I tested it with a simple foreach:

foreach (var a in selectedFew) Console.WriteLine(a);
//The result will be:
//{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
//{ Doc = { Title = 123, SourceId = 5 }, Index = 4 }

I'm not sure that's what you wanted. If not, please comment the answer and I can fix the answer. I hope this helps.

Obs.: All the classes used in my tests were anonymous. So, you don't really need to define a DocumentWithIndex type. Actually, I haven't even declared a Document class.

like image 42
jpbochi Avatar answered Nov 10 '22 10:11

jpbochi


Method Based Syntax:

var selectedFew = docs.GroupBy(doc => new {doc.Title, doc.SourceId}, doc => doc)
                      .SelectMany((grouping) => grouping.Select((doc, index) => new {doc, index}))
                              .GroupBy(anon => new {anon.doc.Title, anon.index})
                              .Select(grouping => grouping.Aggregate((a, b) =>    a.doc.SourceId <= b.doc.SourceId ? a : b));

Would you say the above is the equivalent Method based syntax?

like image 25
Dog Ears Avatar answered Nov 10 '22 11:11

Dog Ears


I implemented an extension method. It supports multiple partition by fields as well as multiple order conditions.

public static IEnumerable<TResult> Partition<TSource, TKey, TResult>(
    this IEnumerable<TSource> source, 
    Func<TSource, TKey> keySelector,
    Func<IEnumerable<TSource>, IOrderedEnumerable<TSource>> sorter,
    Func<TSource, int, TResult> selector)
{
    AssertUtilities.ArgumentNotNull(source, "source");

    return source
        .GroupBy(keySelector)
        .Select(arg => sorter(arg).Select(selector))
        .SelectMany(arg => arg);
}

Usage:

var documents = new[] 
{
    new { Title = "Title1", SourceId = 1 },
    new { Title = "Title1", SourceId = 2 },
    new { Title = "Title2", SourceId = 15 },
    new { Title = "Title2", SourceId = 14 },
    new { Title = "Title3", SourceId = 100 }
};

var result = documents
    .Partition(
        arg => arg.Title,  // partition by
        arg => arg.OrderBy(x => x.SourceId), // order by
        (arg, rowNumber) => new { RowNumber = rowNumber, Document = arg }) // select
    .Where(arg => arg.RowNumber == 0)
    .Select(arg => arg.Document)
    .ToList();

Result:

{ Title = "Title1", SourceId = 1 },
{ Title = "Title2", SourceId = 14 },
{ Title = "Title3", SourceId = 100 }
like image 32
Alex Aza Avatar answered Nov 10 '22 11:11

Alex Aza