Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly load 350M numbers into a double[] array in C#

I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?

Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.

[update]

Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes. Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.

like image 366
A-K Avatar asked Jul 14 '10 20:07

A-K


2 Answers

Well I did a small test and I would definitely recommend using Memory Mapped Files. I Created a File containing 350M double values (2.6 GB as many mentioned before) and then tested the time it takes to map the file to memory and then access any of the elements.

In all my tests in my laptop (Win7, .Net 4.0, Core2 Duo 2.0 GHz, 4GB RAM) it took less than a second to map the file and at that point accessing any of the elements took virtually 0ms (all time is in the validation of the index). Then I decided to go through all 350M numbers and the whole process took about 3 minutes (paging included) so if in your case you have to iterate they may be another option.

Nevertheless I wrapped the access, just for example purposes there a lot conditions you should check before using this code, and it looks like this


public class Storage<T> : IDisposable, IEnumerable<T> where T : struct
{
    MemoryMappedFile mappedFile;
    MemoryMappedViewAccessor accesor;
    long elementSize;
    long numberOfElements;

    public Storage(string filePath)
    {
        if (string.IsNullOrWhiteSpace(filePath))
        {
            throw new ArgumentNullException();
        }

        if (!File.Exists(filePath))
        {
            throw new FileNotFoundException();
        }

        FileInfo info = new FileInfo(filePath);
        mappedFile = MemoryMappedFile.CreateFromFile(filePath);
        accesor = mappedFile.CreateViewAccessor(0, info.Length);
        elementSize = Marshal.SizeOf(typeof(T));
        numberOfElements = info.Length / elementSize;
    }

    public long Length
    {
        get
        {
            return numberOfElements;
        }
    }

    public T this[long index]
    {
        get
        {
            if (index < 0 || index > numberOfElements)
            {
                throw new ArgumentOutOfRangeException();
            }

            T value = default(T);
            accesor.Read<T>(index * elementSize, out value);
            return value;
        }
    }

    public void Dispose()
    {
        if (accesor != null)
        {
            accesor.Dispose();
            accesor = null;
        }

        if (mappedFile != null)
        {
            mappedFile.Dispose();
            mappedFile = null;
        }
    }

    public IEnumerator<T> GetEnumerator()
    {
        T value;
        for (int index = 0; index < numberOfElements; index++)
        {
            value = default(T);
            accesor.Read<T>(index * elementSize, out value);
            yield return value;
        }
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        T value;
        for (int index = 0; index < numberOfElements; index++)
        {
            value = default(T);
            accesor.Read<T>(index * elementSize, out value);
            yield return value;
        }
    }

    public static T[] GetArray(string filePath)
    {
        T[] elements;
        int elementSize;
        long numberOfElements;

        if (string.IsNullOrWhiteSpace(filePath))
        {
            throw new ArgumentNullException();
        }

        if (!File.Exists(filePath))
        {
            throw new FileNotFoundException();
        }

        FileInfo info = new FileInfo(filePath);
        using (MemoryMappedFile mappedFile = MemoryMappedFile.CreateFromFile(filePath))
        {
            using(MemoryMappedViewAccessor accesor = mappedFile.CreateViewAccessor(0, info.Length))
            {
                elementSize = Marshal.SizeOf(typeof(T));
                numberOfElements = info.Length / elementSize;
                elements = new T[numberOfElements];

                if (numberOfElements > int.MaxValue)
                {
                    //you will need to split the array
                }
                else
                {
                    accesor.ReadArray<T>(0, elements, 0, (int)numberOfElements);
                }
            }
        }

        return elements;
    }
}

Here is an example of how you can use the class


Stopwatch watch = Stopwatch.StartNew();
using (Storage<double> helper = new Storage<double>("Storage.bin"))
{
    Console.WriteLine("Initialization Time: {0}", watch.ElapsedMilliseconds);

    string item;
    long index;

    Console.Write("Item to show: ");
    while (!string.IsNullOrWhiteSpace((item = Console.ReadLine())))
    {
        if (long.TryParse(item, out index) && index >= 0 && index < helper.Length)
        {
            watch.Reset();
            watch.Start();
            double value = helper[index];
            Console.WriteLine("Access Time: {0}", watch.ElapsedMilliseconds);
            Console.WriteLine("Item: {0}", value);
        }
        else
        {
            Console.Write("Invalid index");
        }

        Console.Write("Item to show: ");
    }
}

UPDATE I added a static method to load all data in a file to an array. Obviously this approach takes more time initially (on my laptop takes between 1 and 2 min) but after that access performance is what you expect from .Net. This method should be useful if you have to access data frequently.

Usage is pretty simple

double[] helper = Storage<double>.GetArray("Storage.bin");

HTH

like image 97
CriGoT Avatar answered Oct 18 '22 16:10

CriGoT


It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.

(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)

Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.

like image 38
mqp Avatar answered Oct 18 '22 14:10

mqp