Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to PLINQ an existing LINQ query with Joins?

I'm using LINQ to compare two DataSets with each other to create new rows and update existing. I've noticed that the complete comparison lasts ~1,5 hours and only one of the two cores is busy(Task-Manager is 50-52% CPU Usage). I must admit that I'm completely new to parallel LINQ, but I assume that it could increase performance significantly.

So my question is, how and what should I parallelize?

These are the original queries(reduced to the essentials):

'check for new data
Dim srcUnique = From row In src.Email_Total
                Select Ticket_ID = row.ticket_id, Interaction = row.interaction, ModifiedAt = row.modified_time

Dim destUnique = From row In dest.ContactDetail
                 Where row.ContactRow.fiContactType = emailContactType.idContactType
                 Select row.ContactRow.Ticket_ID, row.Interaction, row.ModifiedAt

'get all emails(contactdetails) that are in source but not in destination
Dim diffRows = srcUnique.Except(destUnique).ToList

'get all new emails(according to ticket_id) for calculating contact columns
Dim newRowsTickets = (From row In src.Email_Total
                     Join d In diffRows
                     On row.ticket_id Equals d.Ticket_ID _
                     And row.interaction Equals d.Interaction _
                     And row.modified_time Equals d.ModifiedAt
                     Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets
     Dim contact = dest.Contact.FindByTicket_IDfiContactType(ticket.Ticket_ID, emailContactType.idContactType)
     If contact Is Nothing Then
          ' Create new Contact with many sub-queries on this ticket(omitted) ****'
          Dim newContact = Me.dest.Contact.NewContactRow
          dest.Contact.AddContactRow(newContact)
          contact = newContact
     Else
          ' Update Contact with many sub-queries on this ticket(omitted) '
     End If
     daContact.Update(dest.Contact)

     ' Add new ContactDetail-Rows from this Ticket(this is the counterpart of the src.Email_Total-Rows, details omitted) '
     For Each newRow In ticket.NewTicketRows
         Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
         newContactDetail.ContactRow = contact
         dest.ContactDetail.AddContactDetailRow(newContactDetail)
     Next
     daContactDetails.Update(dest.ContactDetail)
Next

Note: daContact and daContactDetails are SqlDataAdapters, source and dest are DataSets and Contact and ContactDetail are DataTables, where every ContactDetail belongs to a Contact.

Even if not both cores would use 100% CPU, I assume that it would increase performance significantly if I would parallelize the queries, because the second core is nearly idle. The for each might also be a good place to optimize since the tickets are not related to each other. So I assume that I could loop with multiple threads and create/update records parallel. But how to do it with PLINQ?

Side Note: As I've mentioned in the comments, performance is not a key factor for me so far, since the server's only purpose is to synchronize the MySQL Database(on another server) with a MS SQL-Server(on the same server as this Windows-Service). It acts as a source for reports that are generated by another service. But these reports are only generated once a day. But apart from that I was interested in learning PLINQ because I thought that this could be an excellent exercise. It takes the mentioned 1,5h only if destination DB is empty and all records must be created. If both databases are nearly in sync, this method takes only ~1 minute yet. In future performance will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I think that I'll anyway need some kind of (LINQ) Data-Paging then.

If something is unclear I'll update my answer accordingly. Thanks in advance.


Edit: Here is the result of my investigations and attempts:

Question: How to "PLINQ" an existing LINQ query with joins?

Answer: Note that some LINQ operators are binary—they take two IEnumerables as input. Join is a perfect example of such an operator. In these cases, the type of the left-most data source determines whether LINQ or PLINQ is used. Thus you need only call AsParallel on the first data source for your query to run in parallel:

IEnumerable<T> leftData = ..., rightData = ...;
var q = from x in leftData.AsParallel()
        join y in rightData on x.a == y.b
        select f(x, y);

But if I change my query in the following way(note the AsParallel):

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                                        Join d In diffRows
                                        On row.ticket_id Equals d.Ticket_ID _
                                        And row.interaction Equals d.Interaction _
                                        And row.modified_time Equals d.ModifiedAt
                                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

The compiler will complain that I need to add AsParallel to the right datasource as well. So this seem to be a VB.NET issue or a lack of documentation(article is from 2007). I assume the latter because the(apart from that recommendable) article also says that you need to add System.Concurrency.dll manually but actually it is part of .NET 4.0 Framework and in Namespace Sytem.Threading.Tasks.

I realized that I won't profit from a parallelized Except since the query is fast enough in sequential mode(even with nearly the same number of rows in both collection which results in the maximum number of comparisons, I got the result in less than 30 seconds). But I will add it for the sake of completeness later.

So I decided to parallelize the for-each what is as easy as with LINQ-Queries, you simply need to add AsParallel() at the end. But I realized that I need to force the parallelism with WithExecutionMode(ParallelExecutionMode.ForceParallelism), otherwise .NET decides to use only one core for this loop. I also wanted to tell .NET that I wish to use as many Threads as possible but not more than 8: WithDegreeOfParallelism(8).

Now both cores are working at the same time, but the CPU usage stays on 54%.

So this is the PLINQ version so far:

Dim diffRows = srcUnique.AsParallel.Except(destUnique.AsParallel).ToList

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                        Join d In diffRows.AsParallel()
                        On row.ticket_id Equals d.Ticket_ID _
                        And row.interaction Equals d.Interaction _
                        And row.modified_time Equals d.ModifiedAt
                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets.
                    AsParallel().
                      WithDegreeOfParallelism(8).
                       WithExecutionMode(ParallelExecutionMode.ForceParallelism)
    '  blah,blah ...  '

    'add new ContactDetails for this Ticket(only new rows)
    For Each newRow In ticket.NewTicketRows.
                                AsParallel().
                                    WithExecutionMode(ParallelExecutionMode.Default)
        ' blah,blah ... '
    Next
    daContactDetails.Update(dest.ContactDetail)
Next

Unfortunately I don't see any performance benefits from using AsParallel in comparison with sequential mode:

The for each with AsParallel(hh:mm:ss.mm):

09/29/2011 18:54:36: Contacts/ContactDetails created or modified. Duration: 01:21:34.40

And without:

09/29/2011 16:02:55: Contacts/ContactDetails created or modified. Duration: 01:21:24.50

Can somebody explain me this result? Is the database' write access in the for each responsible for the similar time?


Following are recommendable readings:

  • http://msdn.microsoft.com/en-us/magazine/cc163329.aspx (not up-to-date)
    • List of changes since above article
  • Overview of videos/articles according this topic
  • PDF: When To Use ParallelForEach Or PLINQ
  • When PLINQ will speedup performance
like image 606
Tim Schmelter Avatar asked Sep 28 '11 11:09

Tim Schmelter


People also ask

Can we do Joins in LINQ?

In a LINQ query expression, join operations are performed on object collections. Object collections cannot be "joined" in exactly the same way as two relational tables. In LINQ, explicit join clauses are only required when two source sequences are not tied by any relationship.

What is AsParallel in LINQ C#?

AsParallel(IEnumerable) Enables parallelization of a query. AsParallel<TSource>(Partitioner<TSource>) Enables parallelization of a query, as sourced by a custom partitioner that is responsible for splitting the input sequence into partitions.

Can I join a table to a list using LINQ?

When the query actually runs, LINQ should be able to create a temp table or table variable with the data from the local list and then join on that. This is a feature that should absolutely be included in the framework.


1 Answers

There are 3 points worth investigating further,

  1. Do not use .toList(). I might be wrong but I think using .ToList this way would not allow the compiler to optimize the query, if further optimization was possible.
  2. Use your own filtering operation to compare data from both destionations. It might give you better performance.
  3. See if you could use LinqDataview to provide better performance.

    I dont think you will gain an advantage of PLinq while doing insertion. Look at this answer for more details.

Hope that helps. Please do ask if you need clarification on any of the above points.

like image 192
Ali Khalid Avatar answered Sep 27 '22 21:09

Ali Khalid