In memory representation of large data

Question

Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.

so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.

note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.

Marc Gravell · Accepted Answer

This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!

Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...

When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:

For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:

struct Foo {
    private readonly int id;
    private readonly double value;
    public Foo(int id, double value) {
        this.id = id;
        this.value = value;
    }
    public int Id {get{return id;}}
    public double Value {get{return value;}}
}

and stored those directly in arrays (not lists):

Foo[] foos = ...

the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:

private void SomeMethod(ref Foo foo) {
     if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);

Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:

struct Customer {
      ... more not shown
      public int FooIndex { get { return fooIndex; } }
}

Not quite as convenient as customer.Foo, but the following works nicely:

Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);

Key points:

we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading

additional notes:

strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references

As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.

In memory representation of large data

Tags:

c#

Abhash786

1 Answers

Marc Gravell

Recent Activity

Donate For Us

In memory representation of large data

Tags:

c#

Abhash786

1 Answers

Marc Gravell

Related questions

Recent Activity

Donate For Us