Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable Contains method for LINQ against a SQL backend

I'm looking for an elegant way to execute a Contains() statement in a scalable way. Please allow me to give some background before I come to the actual question.

The IN statement

In Entity Framework and LINQ to SQL the Contains statement is translated as a SQL IN statement. For instance, from this statement:

var ids = Enumerable.Range(1,10);
var courses = Courses.Where(c => ids.Contains(c.CourseID)).ToList();

Entity Framework will generate

SELECT 
    [Extent1].[CourseID] AS [CourseID], 
    [Extent1].[Title] AS [Title], 
    [Extent1].[Credits] AS [Credits], 
    [Extent1].[DepartmentID] AS [DepartmentID]
    FROM [dbo].[Course] AS [Extent1]
    WHERE [Extent1].[CourseID] IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Unfortunately, the In statement is not scalable. As per MSDN:

Including an extremely large number of values (many thousands) in an IN clause can consume resources and return errors 8623 or 8632

which has to do with running out of resources or exceeding expression limits.

But before these errors occur, the IN statement becomes increasingly slow with growing numbers of items. I can't find documentation about its growth rate, but it performs well up to a few thousands of items, but beyond that it gets dramatically slow. (Based on SQL Server experiences).

Scalable

We can't always avoid this statement. A JOIN with the source data in stead would generally perform much better, but that's only possible when the source data is in the same context. Here I'm dealing with data coming from a client in a disconnected scenario. So I have been looking for a scalable solution. A satisfactory approach turned out to be cutting the operation into chunks:

var courses = ids.ToChunks(1000)
                 .Select(chunk => Courses.Where(c => chunk.Contains(c.CourseID)))
                 .SelectMany(x => x).ToList();

(where ToChunks is this little extension method).

This executes the query in chunks of 1000 that all perform well enough. With e.g. 5000 items, 5 queries will run that together are likely to be faster than one query with 5000 items.

But not DRY

But of course I don't want to scatter this construct all over my code. I am looking for an extension method by which any IQueryable<T> can be transformed into a chunky executing statement. Ideally something like this:

var courses = Courses.Where(c => ids.Contains(c.CourseID))
              .AsChunky(1000)
              .ToList();

But maybe this

var courses = Courses.ChunkyContains(c => c.CourseID, ids, 1000)
              .ToList();

I've given the latter solution a first shot:

public static IEnumerable<TEntity> ChunkyContains<TEntity, TContains>(
    this IQueryable<TEntity> query, 
    Expression<Func<TEntity,TContains>> match, 
    IEnumerable<TContains> containList, 
    int chunkSize = 500)
{
    return containList.ToChunks(chunkSize)
               .Select (chunk => query.Where(x => chunk.Contains(match)))
               .SelectMany(x => x);
}

Obviously, the part x => chunk.Contains(match) doesn't compile. But I don't know how to manipulate the match expression into a Contains expression.

Maybe someone can help me make this solution work. And of course I'm open to other approaches to make this statement scalable.

like image 833
Gert Arnold Avatar asked Jul 02 '14 14:07

Gert Arnold


People also ask

Which method is valid in LINQ?

The LINQ join methods are supported in LINQ to Entities, with the exception of those that accept an IEqualityComparer because the comparer cannot be translated to the data source.

Is LINQ better than SQL?

The main difference between LINQ and SQL is that LINQ is a Microsoft . NET framework component, which adds native data querying capabilities to . NET languages, while SQL is a standard language to store and manage data in RDBMS.

Is LINQ to SQL deprecated?

LINQ to SQL was the first object-relational mapping technology released by Microsoft. It works well in basic scenarios and continues to be supported in Visual Studio, but it's no longer under active development.

How does a LINQ query transform to a SQL query?

LINQ to SQL translates the queries you write into equivalent SQL queries and sends them to the server for processing. More specifically, your application uses the LINQ to SQL API to request query execution. The LINQ to SQL provider then transforms the query into SQL text and delegates execution to the ADO provider.


2 Answers

I’ve solved this problem with a little different approach a view month ago. Maybe it’s a good solution for you too.

I didn’t want my solution to change the query itself. So a ids.ChunkContains(p.Id) or a special WhereContains method was unfeasible. Also should the solution be able to combine a Contains with another filter as well as using the same collection multiple times.

db.TestEntities.Where(p => (ids.Contains(p.Id) || ids.Contains(p.ParentId)) && p.Name.StartsWith("Test"))

So I tried to encapsulate the logic in a special ToList method that could rewrite the Expression for a specified collection to be queried in chunks.

var ids = Enumerable.Range(1, 11);
var result = db.TestEntities.Where(p => Ids.Contains(p.Id) && p.Name.StartsWith ("Test"))
                                .ToChunkedList(ids,4);

To rewrite the expression tree I discovered all Contains Method calls from local collections in the query with a view helping classes.

private class ContainsExpression
{
    public ContainsExpression(MethodCallExpression methodCall)
    {
        this.MethodCall = methodCall;
    }

    public MethodCallExpression MethodCall { get; private set; }

    public object GetValue()
    {
        var parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        return Expression.Lambda<Func<object>>(parent).Compile()();
    }

    public bool IsLocalList()
    {
        Expression parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        while (parent != null) {
            if (parent is ConstantExpression)
                return true;
            var member = parent as MemberExpression;
            if (member != null) {
                parent = member.Expression;
            } else {
                parent = null;
            }
        }
        return false;
    }
}

private class FindExpressionVisitor<T> : ExpressionVisitor where T : Expression
{
    public List<T> FoundItems { get; private set; }

    public FindExpressionVisitor()
    {
        this.FoundItems = new List<T>();
    }

    public override Expression Visit(Expression node)
    {
        var found = node as T;
        if (found != null) {
            this.FoundItems.Add(found);
        }
        return base.Visit(node);
    }
}

public static List<T> ToChunkedList<T, TValue>(this IQueryable<T> query, IEnumerable<TValue> list, int chunkSize)
{
    var finder = new FindExpressionVisitor<MethodCallExpression>();
    finder.Visit(query.Expression);
    var methodCalls = finder.FoundItems.Where(p => p.Method.Name == "Contains").Select(p => new ContainsExpression(p)).Where(p => p.IsLocalList()).ToList();
    var localLists = methodCalls.Where(p => p.GetValue() == list).ToList();

If the local collection passed in the ToChunkedList method was found in the query expression, I replace the Contains call to the original list with a new call to a temporary list containing the ids for one batch.

if (localLists.Any()) {
    var result = new List<T>();
    var valueList = new List<TValue>();

    var containsMethod = typeof(Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public)
                        .Single(p => p.Name == "Contains" && p.GetParameters().Count() == 2)
                        .MakeGenericMethod(typeof(TValue));

    var queryExpression = query.Expression;

    foreach (var item in localLists) {
        var parameter = new List<Expression>();
        parameter.Add(Expression.Constant(valueList));
        if (item.MethodCall.Object == null) {
            parameter.AddRange(item.MethodCall.Arguments.Skip(1));
        } else {
            parameter.AddRange(item.MethodCall.Arguments);
        }

        var call = Expression.Call(containsMethod, parameter.ToArray());

        var replacer = new ExpressionReplacer(item.MethodCall,call);

        queryExpression = replacer.Visit(queryExpression);
    }

    var chunkQuery = query.Provider.CreateQuery<T>(queryExpression);


    for (int i = 0; i < Math.Ceiling((decimal)list.Count() / chunkSize); i++) {
        valueList.Clear();
        valueList.AddRange(list.Skip(i * chunkSize).Take(chunkSize));

        result.AddRange(chunkQuery.ToList());
    }
    return result;
}
// if the collection was not found return query.ToList()
return query.ToList();

Expression Replacer:

private class ExpressionReplacer : ExpressionVisitor {

    private Expression find, replace;

    public ExpressionReplacer(Expression find, Expression replace)
    {
        this.find = find;
        this.replace = replace;
    }

    public override Expression Visit(Expression node)
    {
        if (node == this.find)
            return this.replace;

        return base.Visit(node);
    }
}
like image 199
codeworx Avatar answered Oct 27 '22 11:10

codeworx


Linqkit to the rescue! Might be a better way that does it directly, but this seems to work fine and makes it pretty clear what's being done. The addition being AsExpandable(), which lets you use the Invoke extension.

using LinqKit;

public static IEnumerable<TEntity> ChunkyContains<TEntity, TContains>(
    this IQueryable<TEntity> query, 
    Expression<Func<TEntity,TContains>> match, 
    IEnumerable<TContains> containList, 
    int chunkSize = 500)
{
    return containList
            .ToChunks(chunkSize)
            .Select (chunk => query.AsExpandable()
                                   .Where(x => chunk.Contains(match.Invoke(x))))
            .SelectMany(x => x);
}

You might also want to do this:

containsList.Distinct()
            .ToChunks(chunkSize)

...or something similar so you don't get duplicate results if something this occurs:

query.ChunkyContains(x => x.Id, new List<int> { 1, 1 }, 1);
like image 31
Ocelot20 Avatar answered Oct 27 '22 13:10

Ocelot20