Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better/faster way to fill a big array in C#

I have 3 *.dat files (346KB,725KB,1762KB) that are filled with a json-string of "big" int-Arrays.

Each time my object is created (several times) I take those three files and use JsonConvert.DeserializeObject to deserialize the arrays into the object.

I thought about using binary-files instead of a json-string or could I even save these arrays directly? I dont need to use these files, it's just the location the data is currently saved. I would gladly switch to anything faster.

What are the different ways to speed up the initialization of these objects?

like image 483
Sven Avatar asked Sep 08 '11 09:09

Sven


People also ask

Why filling an array from the front is slow?

Because filling array from front means the existing objects need to be pushed to the back first before adding the item at index 0? Thanks. That could be the possible reason. Inserting from front would require shifting rest of elements.

How arrays are faster than lists?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

How do you declare a large array?

Usually you need to create such an array dynamically on the heap. int *integer_array = (int*)malloc(2000000 * sizeof(int)); float *float_array = (float*)malloc(2000000 * sizeof(float)); The array might be too large for stack allocation, e.g. if used not globally, but inside a function.

How do you fill an array in C++?

You can use the [] operator and assign a char value. char y[80]; for(int b=0; b<10; ++b) y[b] = 'r'; And yes, std::fill is a more idiomatic and modern C++ way to do this, but you should know about the [] operator too!


2 Answers

The fastest way is to manually serialize the data.

An easy way to do this is by creating a FileStream, and then wrapping it in a BinaryWriter/BinaryReader.

You have access to functions to write the basic data structures (numbers, string, char, byte[] and char[]).

An easy way to write a int[] (unneccesary if it's fixed size) is by prepending the length of the array with either an int/long (depending on the size, unsigned doesn't really give any advantages, since arrays use signed datatypes for their length storage). And then write all the ints.

Two ways to write all the ints would be:
1. Simply loop over the entire array.
2. Convert it into a byte[] and write it using BinaryWriter.Write(byte[])

These is how you can implement them both:

// Writing
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];

writer.Write(intArr.Length);
for (int i = 0; i < intArr.Length; i++)
    writer.Write(intArr[i]);

// Reading
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];

for (int i = 0; i < intArr.Length; i++)
    intArr[i] = reader.ReadInt32();

// Writing, method 2
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];
byte[] byteArr = new byte[intArr.Length * sizeof(int)];
Buffer.BlockCopy(intArr, 0, byteArr, 0, intArr.Length * sizeof(int));

writer.Write(intArr.Length);
writer.Write(byteArr);

// Reading, method 2
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];
byte[] byteArr = reader.ReadBytes(intArr.Length * sizeof(int));
Buffer.BlockCopy(byteArr, 0, intArr, 0, byteArr.Length);

I decided to put this all to the test, with an array of 10000 integers I ran the test 10000 times.

It resulted in method one consumes averagely 888200ns on my system (about 0.89ms).
While method 2 only consumes averagely 568600ns on my system (0.57ms averagely).

Both times include the work the garbage collector has to do.

Obviously method 2 is faster than method 1, though possibly less readable.

Another reason why method 1 can be better than method 2 is because method 2 requires double the amount of RAM free than data you're going to write (the original int[] and the byte[] that's converted from the int[]), when dealing with limited RAM/extremely large files (talking about 512MB+), though if this is the case, you can always make a hybrid solution, by for example writing away 128MB at a time.

Note that method 1 also requires this extra space, but because it's split down in 1 operation per item of the int[], it can release the memory a lot earlier.

Something like this, will write 128MB of an int[] at a time:

const int WRITECOUNT = 32 * 1024 * 1024; // 32 * sizeof(int)MB

int[] intArr = new int[140 * 1024 * 1024]; // 140 * sizeof(int)MB
for (int i = 0; i < intArr.Length; i++)
    intArr[i] = i;

byte[] byteArr = new byte[WRITECOUNT * sizeof(int)]; // 128MB

int dataDone = 0;

using (Stream fileStream = new FileStream("data.dat", FileMode.Create))
using (BinaryWriter writer = new BinaryWriter(fileStream))
{
    while (dataDone < intArr.Length)
    {
        int dataToWrite = intArr.Length - dataDone;
        if (dataToWrite > WRITECOUNT) dataToWrite = WRITECOUNT;
        Buffer.BlockCopy(intArr, dataDone, byteArr, 0, dataToWrite * sizeof(int));
        writer.Write(byteArr);
        dataDone += dataToWrite;
    }
}

Note that this is just for writing, reading works differently too :P. I hope this gives you some more insight in dealing with very large data files :).

like image 83
Aidiakapi Avatar answered Nov 02 '22 22:11

Aidiakapi


If you've just got a bunch of integers, then using JSON will indeed be pretty inefficient in terms of parsing. You can use BinaryReader and BinaryWriter to write binary files efficiently... but it's not clear to me why you need to read the file every time you create an object anyway. Why can't each new object keep a reference to the original array, which has been read once? Or if they need to mutate the data, you could keep one "canonical source" and just copy that array in memory each time you create an object.

like image 40
Jon Skeet Avatar answered Nov 02 '22 23:11

Jon Skeet