When I retrieve a list of items from a database including some children (via .Include), and order the randomly, EF gives me an unexpected result.. I creates/clones addition items..
To explain myself better, I've created a small and simple EF CodeFirst project to reproduce the problem. First i shall give you the code for this project.
Create a basic MVC3 project and add the EntityFramework.SqlServerCompact package via Nuget.
That adds the latest versions of the following packages:
The Models and DbContext
using System.Collections.Generic;
using System.Data.Entity;
namespace RandomWithInclude.Models
{
public class PeopleContext : DbContext
{
public DbSet<Person> Persons { get; set; }
public DbSet<Address> Addresses { get; set; }
}
public class Person
{
public int ID { get; set; }
public string Name { get; set; }
public virtual ICollection<Address> Addresses { get; set; }
}
public class Address
{
public int ID { get; set; }
public string AdressLine { get; set; }
public virtual Person Person { get; set; }
}
}
The DB Setup and Seed data: EF.SqlServerCompact.cs
using System.Collections.Generic;
using System.Data.Entity;
using System.Data.Entity.Infrastructure;
using RandomWithInclude.Models;
[assembly: WebActivator.PreApplicationStartMethod(typeof(RandomWithInclude.App_Start.EF), "Start")]
namespace RandomWithInclude.App_Start
{
public static class EF
{
public static void Start()
{
Database.DefaultConnectionFactory = new SqlCeConnectionFactory("System.Data.SqlServerCe.4.0");
Database.SetInitializer(new DbInitializer());
}
}
public class DbInitializer : DropCreateDatabaseAlways<PeopleContext>
{
protected override void Seed(PeopleContext context)
{
var address1 = new Address {AdressLine = "Street 1, City 1"};
var address2 = new Address {AdressLine = "Street 2, City 2"};
var address3 = new Address {AdressLine = "Street 3, City 3"};
var address4 = new Address {AdressLine = "Street 4, City 4"};
var address5 = new Address {AdressLine = "Street 5, City 5"};
context.Addresses.Add(address1);
context.Addresses.Add(address2);
context.Addresses.Add(address3);
context.Addresses.Add(address4);
context.Addresses.Add(address5);
var person1 = new Person {Name = "Person 1", Addresses = new List<Address> {address1, address2}};
var person2 = new Person {Name = "Person 2", Addresses = new List<Address> {address3}};
var person3 = new Person {Name = "Person 3", Addresses = new List<Address> {address4, address5}};
context.Persons.Add(person1);
context.Persons.Add(person2);
context.Persons.Add(person3);
}
}
}
The controller: HomeController.cs
using System;
using System.Data.Entity;
using System.Linq;
using System.Web.Mvc;
using RandomWithInclude.Models;
namespace RandomWithInclude.Controllers
{
public class HomeController : Controller
{
public ActionResult Index()
{
var db = new PeopleContext();
var persons = db.Persons
.Include(p => p.Addresses)
.OrderBy(p => Guid.NewGuid());
return View(persons.ToList());
}
}
}
The View: Index.cshtml
@using RandomWithInclude.Models
@model IList<Person>
<ul>
@foreach (var person in Model)
{
<li>
@person.Name
</li>
}
</ul>
this should be all, and you application should compile :)
As you can see, we have 2 straightforward models (Person and Address) and Person can have multiple Addresses.
We seed the generated database 3 persons and 5 addresses.
If we get all the persons from the database, including the addresses and randomize the results and just print out the names of those persons, that's where it all goes wrong.
As a result, i sometimes get 4 persons, sometimes 5 and sometimes 3, and i expect 3. Always.
e.g.:
So.. it's copying/cloning data! And that's not cool..
It just seems that EF looses track of what addresses are a child of which person..
The generated SQL query is this:
SELECT
[Project1].[ID] AS [ID],
[Project1].[Name] AS [Name],
[Project1].[C2] AS [C1],
[Project1].[ID1] AS [ID1],
[Project1].[AdressLine] AS [AdressLine],
[Project1].[Person_ID] AS [Person_ID]
FROM ( SELECT
NEWID() AS [C1],
[Extent1].[ID] AS [ID],
[Extent1].[Name] AS [Name],
[Extent2].[ID] AS [ID1],
[Extent2].[AdressLine] AS [AdressLine],
[Extent2].[Person_ID] AS [Person_ID],
CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
FROM [People] AS [Extent1]
LEFT OUTER JOIN [Addresses] AS [Extent2] ON [Extent1].[ID] = [Extent2].[Person_ID]
) AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[ID] ASC, [Project1].[C2] ASC
.Include(p =>p.Addresses)
from the query, everything goes fine. but of course the addresses aren't loaded and accessing that collection will make a new call to the database every time.var persons = db.Persons.Include(p => p.Addresses).ToList().OrderBy(p => Guid.NewGuid());
Does anybody have any idea of why it is happening like this?
Might this be a bug in the SQL generation?
As one can sort it out by reading AakashM answer and Nicolae Dascalu answer, it strongly seems Linq OrderBy
requires a stable ranking function, which NewID/Guid.NewGuid
is not.
So we have to use another random generator that would be stable inside a single query.
To achieve this, before each querying, use a .Net Random generator to get a random number. Then combine this random number with a unique property of the entity to get randomly sorted. And to 'randomize' a bit the result, checksum
it. (checksum
is a SQL Server function that compute a hash; original idea founded on this blog.)
Assuming Person
Id
is an int
, you could write your query this way :
// Random instances should be stored and reused, not instanciated at each usage.
// But beware, it is not thread safe. If you want to share it between threads, you
// would have to use locks, see its documentation.
// https://docs.microsoft.com/en-us/dotnet/api/system.random.
// But using locks is a bad idea for scalability, especially in a Web context.
var randomGenerator = new Random();
// ...
var rnd = randomGenerator.NextDouble();
var persons = db.Persons
.Include(p => p.Addresses)
.OrderBy(p => SqlFunctions.Checksum(p.Id * rnd));
Like the NewGuid
hack, this is very probably not a good random generator with a good distribution and so on. But it does not cause entities to get duplicated in results.
Beware:
If your query ordering does not guarantees uniqueness of your entities ranking, you must complement it for guarantying it. By example, if you use a non-unique property of your entities for the checksum call, then add something like .ThenBy(p => p.Id)
after the OrderBy
.
If your ranking is not unique for your queried root entity, its included children may get mixed with children of other entities having the same ranking. And then the bug will stay here.
Note:
I would prefer use .Next()
method to get an int
then combine it through a xor (^
) to an entity int
unique property, rather than using a double
and multiply it. But SqlFunctions.Checksum
unfortunately does not provide an overload for int
data type, though the SQL server function is supposed to support it. You may use a cast to overcome this, but for keeping it simple I finally had chosen to go with the multiply.
tl;dr: There's a leaky abstraction here. To us, Include
is a simple instruction to stick a collection of things onto each single returned Person
row. But EF's implementation of Include
is done by returning a whole row for each Person-Address
combo, and reassembling at the client. Ordering by a volatile value causes those rows to become shuffled, breaking apart the Person
groups that EF is relying on.
When we have a look at ToTraceString()
for this LINQ:
var people = c.People.Include("Addresses");
// Note: no OrderBy in sight!
we see
SELECT
[Project1].[Id] AS [Id],
[Project1].[Name] AS [Name],
[Project1].[C1] AS [C1],
[Project1].[Id1] AS [Id1],
[Project1].[Data] AS [Data],
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent2].[Id] AS [Id1],
[Extent2].[PersonId] AS [PersonId],
[Extent2].[Data] AS [Data],
CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
FROM [Person] AS [Extent1]
LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
) AS [Project1]
ORDER BY [Project1].[Id] ASC, [Project1].[C1] ASC
So we get n
rows for each A
, plus 1
row for each P
without any A
s.
Adding an OrderBy
clause, however, puts the thing-to-order-by at the start of the ordered columns:
var people = c.People.Include("Addresses").OrderBy(p => Guid.NewGuid());
gives
SELECT
[Project1].[Id] AS [Id],
[Project1].[Name] AS [Name],
[Project1].[C2] AS [C1],
[Project1].[Id1] AS [Id1],
[Project1].[Data] AS [Data],
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT
NEWID() AS [C1],
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent2].[Id] AS [Id1],
[Extent2].[PersonId] AS [PersonId],
[Extent2].[Data] AS [Data],
CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
FROM [Person] AS [Extent1]
LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
) AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[Id] ASC, [Project1].[C2] ASC
So in your case, where the ordered-by-thing is not a property of a P
, but is instead volatile, and therefore can be different for different P-A
records of the same P
, the whole thing falls apart.
I'm not sure where on the working-as-intended ~~~ cast-iron bug
continuum this behaviour falls. But at least now we know about it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With