TLDR for people in the future: I inherited a program and need to adapt it to handle data several magnitudes larger than it currently is. I need help figuring out a way to manage multiple copies of a 30GB array.
I inherited some research code pretty recently (written in c) which was initially written to be run on a small set of data comparatively ~5 GB. The code requires that I am able to access four copies of the array at once (one char array and three double arrays). The author at the time thus did not need to worry about memory usage and as such has multiple instances where there are 2-4 extra arrays concurrently in memory.
Also, this is biological data (genome) and the arrays are not sparse.
The issue is now I have to adapt the code to where a single array of doubles is 30GB.
I am not sure if I need all of the values to be accessible at once, but I know that it is very often the case that the code loops through all of the values.
I split up each of the arrays into sets of 10k characters or doubles and wrote them all into files. I then changed all of the access to functions of my own and had it read from the file or overwrite that line in the file. Although this worked. The issue was that it was terribly slow (most likely due to all of the file openings and closing + speed of writing to disk) which is something that is a problem with the program already and I don't want to make it worse.
I noticed that multiple times there would be periods of time that the program would not need an array and I decided to write that to disk and then read from it when needed. The issue that I am facing is that it still takes a really long time (10 minutes+?) to write the entire array to disk, only opening the file and closing it once (unlike the above method).
As this is for research I do have access to a computing cluster with 150GB of RAM. I submitted this program as a job but unfortunately, even then the process got killed from taking up too much memory. I suspected originally that this was just a memory leak but upon further inspection, it really appears that there are >5 double arrays being created when the program was running. Just as a side note personal machine has 40 (a weird number I know) GB of memory.
I disabled the kernel from overpromising memory because I noticed that it was crashing not when allocating many of the arrays but instead when it actually started accessing them. However, I don't think this ended up doing anything because it still overpromises.
One night I got quite frustrated that it was being killed all the time and decided to run the program with a niceness of -10000 which resulted in my computer crashing as it killed other processes to make up for more memory.
I also played around with using mmap() but am not sure if this is something that I should pursue.
Although I really can't be sure if it is an XY problem, I feel pretty confident that I need to have at least three arrays concurrently (although I don't jump around too much in the arrays).
Does anyone have any expertise on how to fix this issue? Thank you for your help in advance. And finally, I am using Linux.
7 Ways to Handle Large Data Files for Machine Learning 1. Allocate More Memory 2. Work with a Smaller Sample 3. Use a Computer with More Memory 4. Change the Data Format 5. Stream Data or Use Progressive Loading 6. Use a Relational Database 7. Use a Big Data Platform Summary
Generally, the memory usage of the data frame can be reduced by converting them to correct datatypes. Almost all the datasets include object datatype which is generally in string format which is not memory efficient.
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those chunks individually. After all the chunks have been processed, you can compare the results and calculate the final findings. This dataset contains 1923 rows.
Option 1: Use the Copy Dataset icon. Go to the BigQuery page in the Cloud Console. Go to the BigQuery page Select the dataset name of the source dataset that you want to copy.
This sounds like a good use case for mmap.
The mmap function can be used to take an open file and map it to a region of memory.  Reads and writes to the file via the returned pointer are handled internally, although you can periodically flush to disk manually.  This will allow you to manipulate a data structure larger than the physical memory of the system.
This also has the advantage that you don't need to worry about moving data back and forth from disk manually. The kernel will take care of it for you.
So for each of these large arrays, you can create a memory mapping backed by a file on disk.
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#define DATA_LEN 30000000000LL
int main()
{
    int array1_fd = open("/tmp/array1", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (array1_fd < 0) {
        perror("open failed");
        exit(1);
    }
    // make sure file is big enough
    if (lseek(array1_fd, DATA_LEN, SEEK_SET) == -1) {
        perror("seek to len failed");
        exit(1);
    }
    if (write(array1_fd, "x", 1) == -1) {
        perror("write at end failed");
        exit(1);
    }
    if (lseek(array1_fd, 0, SEEK_SET) == -1) {
        perror("seek to 0 failed");
        exit(1);
    }
    char *array1 = mmap(NULL, DATA_LEN, PROT_READ | PROT_WRITE, MAP_SHARED, array1_fd, 0);
    if (array1 == MAP_FAILED) {
        perror("mmap failed");
        exit(1);
    }
    // Use array1
    munmap(array1, DATA_LEN);
    close(array1_fd);
    return 0;
}
The important part of the mmap call is the MAP_SHARED flag.  This means that updates to the mapped memory region are carried through to the underlying file descriptor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With