Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# How to handle large lists of data quickly

Tags:

c#

I have tried to find some articles on my problem, but haven't found anything that is relevant or that makes sense for my application. Here is my issue:

I have two lists of (> 20,000) items.

I need to check each item in each list against every item in the opposite list.

An implementation of something like this:

    foreach(var item1 in List1)
    {
         foreach(var item2 in List2)
         {
              // Check item 1 against item 2. 
              // Check item 2 against item 1.
         }
    }

is extremely slow and unusable because of the work done for the checks.

Is there a more efficient way to handle these large lists of items that need a check like this?

Please let me know if there is more information I can provide. Thanks for any help/ suggestions.

I am using C# .NET 3.5

EDIT: Let me try and explain the checks in brief way.

item1 and item2 are part of a pathing system. item1 and item2 are connected by N number of other items. I am checking if item1 is connected (valid path) to item2, and item2 is connected to item1. It cannot be assumed that if item1 -> item2, than item2 -> item1. So both checks are necessary.

The database contains the information if and how item1 -> item2 and if/how item2 -> item1. Inside the check, there is a named pipe call to a service to do the check. The service does all the path checking and returns if item1 -> item2, etc.

like image 525
therealjohn Avatar asked Jun 20 '12 17:06

therealjohn


4 Answers

That's an O(N * M) check.

If you're just comparing for equality on some key or other, then you can get away with O(N + M) iterations, assuming a reasonable hash code and a good distribution of keys. The simplest way to do this in .NET is with a LINQ join:

var pairs = from x in List1
            join y in List2 on x.Key1 equals y.Key2
            select new { x, y}; // Or whatever

foreach (var pair in pairs)
{
    // Process each match
}

Of course, if you're not checking for equality, this doesn't help... but it's pretty much impossible to give any concrete help without more context.

like image 194
Jon Skeet Avatar answered Nov 07 '22 08:11

Jon Skeet


Try to avoid situation when for each iteration goes request to database. When possible try to make all in one query outside the loop, or fetch needed data outside the loop and then make your checks on this data.

All depends from checks operations. So describe them. But anyway, if your iterations are independent you can also parallelize your loops using PLINQ and Task Parallel Libary

Data Parallelism (Task Parallel Library)

How to: Write a Simple Parallel.ForEach Loop

like image 30
Regfor Avatar answered Nov 07 '22 10:11

Regfor


I would suggest converting both sides into hash tables (O(n)) for each table and scan through each list and do the O(1) look up in the other table to check if it contains the current item (o(n) overall). this results in a O(n) overall.

I've done something similar with lists of ~1,000,000 and it usually finishes in the ~1 second range from what I remember.

like image 30
Andrew Long Avatar answered Nov 07 '22 10:11

Andrew Long


Long loop + database queries = terrible performance.

What you should attempt to do is to run some queries first, get the data you need, and then do the N x M checks against that data.

This isn't necessarily possible, of course; really depends on the kinds of checks you're doing.

like image 33
Roman Starkov Avatar answered Nov 07 '22 10:11

Roman Starkov