Is is possible to have a array of a custom size data type (3/5/6/7 byte) in C or Cython?
I have run across a major memory inefficiency while attempting to code a complex algorithm. The algorithm calls for the storage of a mind-blowing amount of data. All the data is arranged in a contiguous block of memory (like an array). The data is simply a very long list of [usually] very large numbers. The type of the numbers in this list/array is constant given a certain set of numbers (they operate almost as a regular C array, where all numbers are of the same type in a array)
Sometimes it is not efficient to store each number in a standard data size. Usually the normal data types are char, short, int, long etc... However if I use a int array to store a data type that is only in the range that can be stored in 3 bytes, on each number I lose 1 byte of space. This makes extreme inefficiency, and when you are storing millions of numbers, the effects are memory breaking. There is unfortunately no other way to implement the solution to the algorithm, and I believe a rough implementation of a custom data size is the only way to do this.
I have tried to use char arrays to complete this task, but the conversions between the different 0 - 255 value bits to form a larger data type is just inefficient in most cases. Often times, there is a mathematical method of taking chars and packing them into a larger number, or taking that larger number, and dividing out its individual chars. Here was an extremely inefficient algorithm of my attempt at this, written in Cython:
def to_bytes(long long number, int length):
cdef:
list chars = []
long long m
long long d
for _ in range(length):
m = number % 256
d = number // 256
chars.append(m)
number = d
cdef bytearray binary = bytearray(chars)
binary = binary[::-1]
return binary
def from_bytes(string):
cdef long long d = int(str(string).encode('hex'), 16)
return d
Keep in mind I don't exactly want improvements to this algorithm, but a fundamental way of declaring an array of a certain data type, so I don't have to do this transformation.
If you want to change the size, you need to create a new array of the desired size, and then copy elements from the old array to the new array, and use the new array. In our example, arr can only hold int values. Arrays can hold primitive values, unlike ArrayList, which can only hold object values.
In a C++ array declaration, the array size is specified after the variable name, not after the type name as in some other languages. The following example declares an array of 1000 doubles to be allocated on the stack. The number of elements must be supplied as an integer literal or else as a constant expression.
A Dynamic array (vector in C++, ArrayList in Java) automatically grows when we try to make an insertion and there is no more space left for the new item. Usually the area doubles in size.
In C++, we use sizeof() operator to find the size of desired data type, variables, and constants. It is a compile-time execution operator. We can find the size of an array using the sizeof() operator as shown: // Finds size of arr[] and stores in 'size' int size = sizeof(arr)/sizeof(arr[0]);
I think the important question is if you need to have access to all the data at the same time.
If you only need to access one chunk of data at the same time
If you only need to access one array at a time, then one Pythonic possibility is to use NumPy arrays with data type uint8
and width as needed. When you need to operate on the data, you expand the compressed data by (here 3-octet numbers into uint32
):
import numpy as np
# in this example `compressed` is a Nx3 array of octets (`uint8`)
expanded = np.empty((compressed.shape[0], 4))
expanded[:,:3] = compressed
expanded[:, 3] = 0
expanded = expanded.view('uint32').reshape(-1)
Then the operations are performed on expanded
which is a 1-d vector of N uint32
values.
After we are done, the data can be saved back:
# recompress
compressed[:] = expanded.view('uint8').reshape(-1,4)[:,:3]
The time taken for each direction is (in my machine with Python) approximately 8 ns per element for the example above. Using Cython may not give much performance advantage here, because almost all time is spent copying the data between buffers somewhere in the dark depths of NumPy.
This is a high one-time cost, but if you are planning to access each element at least once, it is probably less expensive to pay the one-time cost than a similar cost for each operation.
Of course, the same approach can be taken in C:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>
#define NUMITEMS 10000000
int main(void)
{
uint32_t *expanded;
uint8_t * cmpressed, *exp_as_octets;
struct rusage ru0, ru1;
uint8_t *ep, *cp, *end;
double time_delta;
// create some compressed data
cmpressed = (uint8_t *)malloc(NUMITEMS * 3);
getrusage(RUSAGE_SELF, &ru0);
// allocate the buffer and copy the data
exp_as_octets = (uint8_t *)malloc(NUMITEMS * 4);
end = exp_as_octets + NUMITEMS * 4;
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
// copy three octets out of four
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = 0;
}
expanded = (uint32_t *)exp_as_octets;
getrusage(RUSAGE_SELF, &ru1);
printf("Uncompress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
getrusage(RUSAGE_SELF, &ru0);
// compress back
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
*cp++ = *ep++;
*cp++ = *ep++;
*cp++ = *ep++;
ep++;
}
getrusage(RUSAGE_SELF, &ru1);
printf("Compress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
}
This reports:
Uncompress
User: 0.022650 seconds, 2.27 nanoseconds per element
System: 0.016171 seconds, 1.62 nanoseconds per element
Compress
User: 0.011698 seconds, 1.17 nanoseconds per element
System: 0.000018 seconds, 0.00 nanoseconds per element
The code was compiled with gcc -Ofast
and is probably relatively close to the optimal speed. The system time is spent with the malloc
. To my eye this looks pretty fast, as we are doing memory reads at 2-3 GB/s. (Which also means that while making the code multi-threaded would be easy, there may not be much speed benefit.)
If you want to have the best performance, you'll need to code the compression/decompression routines for each data width separately. (I do not promise the C code above is the absolutely fastest on any machine, I did not take a look at the machine code.)
If you need to random access separate values
If you, instead, need to access only one value here and another there, Python will not offer any resonably fast methods, as the array lookup overhead is huge.
In this case I suggest you create the C routines to fetch and put the data back. See technosaurus
's answer. There are a lot of tricks, but the alignment problems cannot be avoided.
One useful trick when reading the odd-sized array might be (here reading 3 octets from an octet array compressed
into a uint32_t
value
):
value = (uint32_t *)&compressed[3 * n] & 0x00ffffff;
Then someone else will take care of the possible misalignment, and there'll be one octet of garbage in the end. Unfortunately this cannot be used when writing the values. And - again - this may or may not be faster or slower than any of the other alternatives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With