Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Entity Framework and Parallelism

Background

I have an application that receives periodic data dumps (XML files) and imports them into an existing database using Entity Framework 5 (Code First). The import happens via EF5 rather than say BULK INSERT or BCP because business rules that already exist in the entities must be applied.

Processing seems to be CPU bound in the application itself (the extremely fast, write-cache enabled disk IO subsystem shows almost zero disk wait time throughout the process, and SQL Server shows no more than 8%-10% CPU time).

To improve efficiency, I built a pipeline using TPL Dataflow with components to:

Read & Parse XML file
        |
        V
Create entities from XML Node
        |
        V
Batch entities (BatchBlock, currently n=200)
        |
        V
Create new DbContext / insert batched entities / ctx.SaveChanges()

I see a substantial increase in performance by doing this, but can not get the CPU above about 60%.

Analysis

Suspecting some sort of resource contention, I ran the process using the VS2012 Profiler's Resource contention data (concurrency) mode.

The profiler shows me 52% contention for a resource labeled Handle 2. Drilling in, I see that the method creating the most contention for Handle 2 is

System.Data.Entity.Internal.InternalContext.SaveChanges()

Second place, at about 40% as many contentions as SaveChanges(), is

System.Data.Entity.DbSet`1.Add(!0)

Questions

  • How can I figure out what Handle 2 really is (e.g. part of TPL, part of EF)?
  • Does EF throttle calls to separate DbContext instances from separate threads? It seems there is a shared resource they are contending for.
  • Is there anything that I can do to improve parallelism in this case?

UPDATE

For the run in question, the maximum degree of parallelism for the task that calls SaveChanges is set to 12 (I tried various values including Unbounded in previous runs).

UPDATE 2

Microsoft's EF team has provided feedback. See my answer for a summary.

like image 682
Eric J. Avatar asked Nov 01 '12 17:11

Eric J.


People also ask

Is a DbContext per thread in parallel ForEach safe?

I have researched this, and I agree that DbContext is not thread-safe. The pattern I propose does use multiple threads, but a single DbContext is only every accessed by a single thread in a single-threaded fashion.

What are the three types of Entity Framework?

There are three approaches to model your entities in Entity Framework: Code First, Model First, and Database First. This article discusses all these three approaches and their pros and cons.

Is ForEach parallel?

ForEach loop works like a Parallel. For loop. The loop partitions the source collection and schedules the work on multiple threads based on the system environment. The more processors on the system, the faster the parallel method runs.

What is Entity Framework used for?

The Entity Framework enables developers to work with data in the form of domain-specific objects and properties, such as customers and customer addresses, without having to concern themselves with the underlying database tables and columns where this data is stored.


1 Answers

The following summarizes my interaction with the Entity Framework team on this issue. I'll update the answer if more information becomes available

  • The issue can be reproduced at Microsoft.
  • The handle contention is related to Network I/O (even with SQL Server on localhost). Specifically, there is contention for the reading buffer for Network I/O in System.Data.dll.
  • The EF team is now working with the SQL Connectivity team to better understand the issue.
  • There is as yet no guidance from Microsoft on how to minimize the impact of this contention.

UPDATE

This issue is now being tracked on CodePlex:

http://entityframework.codeplex.com/workitem/636?PendingVoteId=636

like image 176
Eric J. Avatar answered Sep 29 '22 21:09

Eric J.