Background
I have an application that receives periodic data dumps (XML files) and imports them into an existing database using Entity Framework 5 (Code First). The import happens via EF5 rather than say BULK INSERT or BCP because business rules that already exist in the entities must be applied.
Processing seems to be CPU bound in the application itself (the extremely fast, write-cache enabled disk IO subsystem shows almost zero disk wait time throughout the process, and SQL Server shows no more than 8%-10% CPU time).
To improve efficiency, I built a pipeline using TPL Dataflow with components to:
Read & Parse XML file
|
V
Create entities from XML Node
|
V
Batch entities (BatchBlock, currently n=200)
|
V
Create new DbContext / insert batched entities / ctx.SaveChanges()
I see a substantial increase in performance by doing this, but can not get the CPU above about 60%.
Analysis
Suspecting some sort of resource contention, I ran the process using the VS2012 Profiler's Resource contention data (concurrency) mode.
The profiler shows me 52% contention for a resource labeled Handle 2. Drilling in, I see that the method creating the most contention for Handle 2 is
System.Data.Entity.Internal.InternalContext.SaveChanges()
Second place, at about 40% as many contentions as SaveChanges(), is
System.Data.Entity.DbSet`1.Add(!0)
Questions
UPDATE
For the run in question, the maximum degree of parallelism for the task that calls SaveChanges is set to 12 (I tried various values including Unbounded in previous runs).
UPDATE 2
Microsoft's EF team has provided feedback. See my answer for a summary.
I have researched this, and I agree that DbContext is not thread-safe. The pattern I propose does use multiple threads, but a single DbContext is only every accessed by a single thread in a single-threaded fashion.
There are three approaches to model your entities in Entity Framework: Code First, Model First, and Database First. This article discusses all these three approaches and their pros and cons.
ForEach loop works like a Parallel. For loop. The loop partitions the source collection and schedules the work on multiple threads based on the system environment. The more processors on the system, the faster the parallel method runs.
The Entity Framework enables developers to work with data in the form of domain-specific objects and properties, such as customers and customer addresses, without having to concern themselves with the underlying database tables and columns where this data is stored.
The following summarizes my interaction with the Entity Framework team on this issue. I'll update the answer if more information becomes available
UPDATE
This issue is now being tracked on CodePlex:
http://entityframework.codeplex.com/workitem/636?PendingVoteId=636
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With