Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Most efficient data structure to insert and to remove lower half

Imagine I have a big sorted list of integers (>1000 items). I need to be able to do two operations on this list: remove the lower half and fill the list again to its original size by inserting random integers. Because I do these operations about a million times, I need it to be as efficient as possible.

The first thing I did was just using a List that I kept sorted by adding new items at the right place. Although removing the lower half of a sorted list is very easy, inserting takes quite a bit of time.

I tried implementing a skip list instead, but after some testing it seemed that the size of the list had to be at least 10 000 to really matter, otherwise it performed even worse than my normal list.

That's why I decided to use an AVL tree, so I could insert items much, much faster. But the problem is that I don't know any efficient way of deleting the lower half of such a binary search tree.

My question is: is there an efficient way to do this? Is there another data structure that I could use more easily?

Edit

As asked, I made a small test showing the difference in performance between a list, a skip list and an AVL tree. I made the skip list using this tutorial on msdn: Skip list tutorial. The AVL tree comes from here: AVL tree. I uploaded the test on Pastebin: Program.

In the test, I add 100 000 items to each datastructure while timing. On my pc, the list took about 1 second, the skip list 0.5 seconds and the AVL tree 0.045 seconds. If I would do this a million times like I want to, the list would take about 11.5 days, but the AVL tree would only take about half a day. This time difference clearly shows why I want it to be efficient.

like image 400
Safron Avatar asked Mar 31 '16 10:03

Safron


3 Answers

There's a few things about this question that I'd like to point out. First off, let's get a few things straight regarding performance and C# in general, because it's hard to explain stuff while there are still misconceptions.

Next, I'll apply everything I'll to the specific question here.

Performance in C# in general

Big-O notation

On university, you learn how O(n) is always better than O(n^2) and how O(n) is always better than O(n log n). However, the underlying assumption for this is that every operation will cost roughly the same amount of time.

Now, when I first started programming on an 1802 RISC processor in 1986, this was very much the case: a memory operation was 1 clock tick, and so was an add, subtract, etc. In other words, Big-O works fine there.

In a modern computer, it's more difficult:

  1. Data is cached at various levels (speeds range from 15 GB/s to 1000 GB/s);
  2. Operations vary between 0.5 clock tick and dozens of clock ticks;
  3. Data is usually fetched in bursts - so random access is much worse than sequential;
  4. Vectorization can process up to 8 integers of aligned data at once;
  5. Branch misprediction can throw everything off balance because you have to flush the lot.

I've observed that the difference in performance for different implementations of the same algorithm can be as much as a factor 1000 (!)

Big-O still holds merit though, but you should put things into perspective. For example, say that you have N=10000, then 2log N ~ 13 -- and if that means you can benefit from all these things, it might also mean that a 'stupid' O(n log n) algorithm might just outperform your average O(n) algorithm.

From this you should also deduce that an O(n^2) won't ever outperform an O(n) algorithm. So, Big-O still has its uses; you just have to put things into perspective.

Some characteristics about C#

A myth about C# is that it's approximately as fast as C++ (which is my golden standard for "as fast as it gets"). In the hands of a skilled developer it's not, simple as that. For simple sorting, C++ is approximately 2x as fast - but if you have more complicated scenario's where you can really benefit from the "low level stuff" the difference can become quite huge. I usually estimate that the difference in performance is a factor 10. However, writing proper, high performance C++ code is challenging (to use an understatement), so you might want to stick with C# and decide to take the performance hit for granted.

One thing of interest is that the C# compiler and JIT compile stuff pretty fast. In part, that's because they compile everything per-function (so, no inlining, etc). Also, C# doesn't normally vectorize stuff. Don't take my word for it, use ctrl-alt-d in Visual studio and check the assembler output for yourself.

If we look at the list above, we can roughly state that (1),(2) and (3) aren't influenced by the fact that we're using C#; (4) is definitely influenced and (5) depends.

As for (5), consider this simple example:

void set(int[] array, int index) 
{
    array[index] = 0;
}

Remember that in C# methods are compiled per-method. This means that the compiler cannot assume that index won't be out-of-bounds. In other words: it has to add two checks - and one of these will have to load memory:

if (index < 0 || index >= array.Length) 
{ 
    throw new IndexOutOfRangeException(); 
}

Sorting items

The question by the OP is about maintaining a sorted list of size m. Sorting is a well-known operation, which will cost O(log m) per item you insert at best. Since you're processing n 'random' items, you will get a best possible speed of O(n log m).

A binary heap (based on an array) might get you that performance number, but I don't feel like writing down a heap right now, and think this alternative is about the same speed (if not faster) :)

Your question

Now that we've established what we're talking about, let's just get into the facts at hand. I'll just explain this every step of the way.

First, while doing performance stuff, I make it a habit to remove using System.Linq so we know we're just working on the native data structures with the expected characteristics.

Let's start with a tree structure

Another simple solution would be to use a red-black tree. We have one at our disposal in .NET, called a SortedSet. It uses references, arithmetics, etc -- which is basically all the nasty stuff that I've warned about in (1), (2) and (3). Now, there are errors in the implementation here (for duplicates), but the speed is pretty much what you would expect:

static void Main(string[] args)
{
    Random rnd = new Random(12839);

    SortedSet<int> list = new SortedSet<int>();

    for (int i = 0; i < 5000; ++i)
    {
        list.Add(rnd.Next());
    }

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 10000; ++i)
    {
        for (int j = 0; j < 5000; ++j)
        {
            list.Add(rnd.Next());
        }
        int n = 0;
        list.RemoveWhere((a) => n++ < 5000);
    }

    Console.WriteLine(sw.ElapsedMilliseconds);

    Console.ReadLine();
}

Like practically all algorithms here, this algorithm executes in O(n log m).

What I approximately expect from AVL trees: 86220 ms.

Naive implementation

Normally I wouldn't have bothered with the red-black trees. Still, since you put a lot of work in the AVL trees, I felt it was necessary to get this measurement.

When I'm doing performance optimizations of algorithms, I always start off with the easiest-to-implement algorithm that has approximately the right Big-O and always prefer the thing that has the easiest data structure (read: array). In this case, it's a list combined with a standard sort, which will give O(m log m) for every sort, executed m/n times, and O(n) data operations. The result is O(n + n log m).

So, why go for the easiest implementation you might ask? The answer is simple: easy implementations are also easy to compile & optimize, because they usually don't have a lot of branches, don't use a lot of random memory access, etc.

The most naive implementation (that I've already suggested in my comment) is to simply put stuff in an array, sort it, and then remove the bottom half of it.

Basically this can be implemented like this in a minimum test case:

static void Main(string[] args)
{
    Random rnd = new Random(12839);
    List<int> list = new List<int>();

    for (int i = 0; i < 5000; ++i)
    {
        list.Add(rnd.Next());
    }

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 10000; ++i)
    {
        for (int j = 0; j < 5000; ++j)
        {
            list.Add(rnd.Next());
        }

        list.Sort((a, b) => a.CompareTo(b)); // #1
        list.RemoveRange(0, 5000);
    }

    Console.WriteLine(sw.ElapsedMilliseconds);
    Console.ReadLine();
}

Baseline performance: 10047 ms.

Optimization 1: remove method calls and branches

Method calls cost time. So do branches. So, if we don't need to branch, we might as well just eliminate that. In other words: this is about (5).

One thing that comes to mind is to replace #1 by:

list.Sort((a, b) => a - b);

In most (!) scenario's this gives the desired result, I bluntly assume this scenario is no exception.

Measurement: 8768 ms. (Yes people, that's a 15% change!)

For the fun of it, we also do a simple test for (2) with:

list.Sort((a, b) => (int)((float)a - (float)b));

It's exactly the same size of the operators (32 bits), it's exactly the same data and will give the same results -- we're just comparing everything with a different CPU operation and adding some casts. Measurement: 10902 ms. This is more than you would expect if every operation was just a single clock tick.

Optimization 2: Arrays or lists?

I could also care about the list itself; we're using quite a few calls to it, so we might substitute it for an array. We could even eliminate the RemoveRange if we would invert the sort order. So why don't I focus on that instead? Well, actually I could, but I can tell you it won't make a lot of difference, because relatively speaking, it's not called that often. Still, no harm in testing, right?:

static void Main(string[] args)
{
    Random rnd = new Random(12839);

    int[] list = new int[10000];

    for (int i = 0; i < 5000; ++i)
    {
        list[i] = rnd.Next();
    }

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 10000; ++i)
    {
        for (int j = 0; j < 5000; ++j)
        {
            list[j + 5000] = rnd.Next();
        }
        Array.Sort(list, (a, b) => b - a);
    }

    Console.WriteLine(sw.ElapsedMilliseconds);

    Console.ReadLine();
}

Now, there are two measurements here:

  • Changing the list to an array just changed it to +/- 8700 ms - which doesn't make much of a difference
  • Reversing the order changed the result to 7456 ms.

The reason this doesn't really make a difference, is that the underlying data structure for a List is an array, so if we're sorting, we're just working on the same thing. And that's where our time is.

The thing to remember here is not that arrays are just as fast as a List. Truth is: I've found that if they are they are actually faster in a lot of cases. However, in this case, we're not talking about optimizations in the inner loop, we're not overallocating too much memory (everything is kept in cache probably) and all memory access is aligned. All in all, we can therefore expect the difference to be quite small.

Optimization 3: remove more method calls

Now, you should notice that there's also an alternative here: method calls cost time, and the one here that's called the most here is the comparer. So let's go back to the solution with the List and remove the compare operation. When we do that, we still have to copy. What do you expect?

Change the line to this:

list.Sort();

... and we have a new timing: 4123 ms.

Now, to be entirely fair, actually what we've done here is changing our inline delegate to Comparer<int>.Default, which is the default implementation of the integer comparer. The delegate would have been wrapped in a comparer, creating 2 virtual calls - this is just 1 call. This means we could have reversed the order by implement our own comparer class as well, which would have been an even faster solution.

Optimization 4: Merge-join

Why sort everything if we only need to sort half of the data? That doesn't make sense, right?

Again, I pick the simplest possible algorithm to ackomplish the task. We walk through the list in sequential order, and store new items in sequential order, c.f. (1) and (3). No swapping, remember that we prefer sequential data access patterns. Then, we simply remove all the stuff we don't need anymore.

The algorithm we need is a merge-join, and works like this:

Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000; ++i)
{
    for (int j = 0; j < 5000; ++j)
    {
        list.Add(rnd.Next());
    }

    // Sort the second half:
    list.Sort(5000, 5000, Comparer<int>.Default);
            
    // Both the lower and upper half are sorted. Merge-join:
    int lhs = 0;
    int rhs = 5000;
    while (list.Count < 15000)
    {
        int l = list[lhs];
        int r = list[rhs];
        if (l < r)
        {
            list.Add(l);
            ++lhs;
        }
        else if (l > r)
        {
            list.Add(r);
            ++rhs;
        }
        else
        {
            while (list.Count < 15000 && list[lhs] == l)
            {
                list.Add(l);
                ++lhs;
            }
            while (list.Count < 15000 && list[rhs] == r)
            {
                list.Add(r);
                ++rhs;
            }
        }
    }

    list.RemoveRange(0, 10000);
}

We have a new measurement, it's 3563 ms.

Optimization 5: RemoveRange #2

Remember, processing data in burst is very fast. The last piece of code is a perfect opportunity to show this in action. We use a RemoveRange here, which processes data in burst. We can also use two buffers and swap them around. Basically we write in a second list2 during the merge-join and substitute the RemoveRange with:

list.Clear();

var tmp = list;
list = list2;
list2 = tmp;

We now have a new timing: 3542 ms. Exactly the same!

From the last two you should conclude that doing burst operations costs so little time, that you usually shouldn't even bother.

Conclusion

I've started off with a tree that executed everything in 86220 ms and ended up with an algorithm that took 3542 ms. Quite blunt this means that the last implementation executes in 4% of the time of the first attempt.

Now, I did use different algorithms throughout this long answer - but the point was to show you how to do all these optimizations and what effect optimizations really have.

like image 87
atlaste Avatar answered Sep 29 '22 19:09

atlaste


I assume you want to keep the list sorted at all times. The best way to replace the lower half with random integers is this:

  1. Remove the lower half.
  2. Add random integers to restore the original size
  3. Sort the list

Previously, you inserted at the right position. This effectively implements a selection sort which is extremely slow. Rather, let the built-in efficient sort algorithm do the heavy work.

This should be O(n * log n). Previously it was O(n^2).

You can optimize this by a constant factor by not removing the first half. Instead, replace it with random numbers and then sort.

like image 26
usr Avatar answered Sep 29 '22 18:09

usr


Why the assumption that you need a different data-structure? To say that:

The first thing I did was just using a List that I kept sorted by adding new items at the right place. Although removing the lower half of a sorted list is very easy, inserting takes quite a bit of time

Concerns me, because you may be using the correct[1] data-structure, with a poor algorithm. May I strongly suggest you take a look at http://sscce.org/ and include it in your question?

But List Insertion is Slow O(n)!

Don't insert!

As explained by @usr a far better algorithm might be something like:

  1. Remove the lower half.
  2. Add random integers to restore the original size
  3. Sort the list

No need to change the data structure, but a big change to how you solve the problem.

This is especially important because, as reiterated by @atlaste not every system is equal regardless of O(?):

Things just aren't that easy anymore with modern processors. I've seen cases where different implementations of the same algorithms give you a factor 100+ in difference due to branch prediction, vectorizing and cache locality. O(..) calculations are great if you're comparing apples with apples - but unfortunately that might not be the case. I wish everyone would just watch this video: youtube.com/watch?v=GPpD4BBtA1Y

But I still a O(log n) over a O(n) data-structure!

Okay, before we finish here and move onto actually outline the algorithm you use and measuring performance (which seems too much to ask at current) let me ask you one question:

Let's assume we have a "big sorted list of integers (>1000 items)". In fact let's say that this list is 10,000 items long!

Which of these has better insertion performance in terms of O(?)?

A) List

B) Linked List

C) Binary Tree

When you're ready, take a look at the answer:

They all have O(1)! O(n) only tells you how well things scale (relative to themselves, and only in broad terms). Since the list was of a fixed size '10,000 items' there is no scaling (everything is considered a 'constant factor'). Note, I'm not claiming that these structures are equally performant... just that O(?) has limits in its description. For more info What is a plain English explanation of "Big O" notation?

Benchmark

Here's a benchmark of insertion sort, vs sort after adding all new random items: http://pastebin.com/pNgx73cs

Results (Default Settings)

Testing performance of filling a list to 10000 items 1000 times, discarding 1/2
of items after every fill!

Old list: 3248ms
New list: 547ms
DONE

Note that even though we have a much more efficient approach in O(?) terms, the results are not that far apart because at this size CPU's are surprisingly fast!

Notes:

  1. OP has relatively small[2] collection of integers, that should easily fit inside CPU cache[3] even L1 cache. In these instance small, contiguous and predictable memory use (e.g. array and list (in C# based on an array)) can be very efficient.
  2. Assuming 10,000 as an upper bound, this would only be ~40KB or ~80KB for 32bit and 64 bit systems respectively.
  3. Intel Skylake has 64 KiB of L1 cache and 256KiB of L2 cache per core: https://en.wikipedia.org/wiki/Skylake_(microarchitecture)
like image 25
NPSF3000 Avatar answered Sep 29 '22 19:09

NPSF3000