Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can we allocate a 1 PB (10^15) array and get access to the last element, but can't free it?

As known: http://linux.die.net/man/3/malloc

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. In case it turns out that the system is out of memory, one or more processes will be killed by the OOM killer.

And we can successfully allocate 1 Petabyte of VMA (virtual memory area) by using malloc(petabyte);: http://ideone.com/1yskmB

#include <stdio.h>
#include <stdlib.h>

int main(void) {

    long long int petabyte = 1024LL * 1024LL * 1024LL * 1024LL * 1024LL;    // 2^50
    printf("petabyte %lld \n", petabyte);

    volatile char *ptr = (volatile char *)malloc(petabyte);
    printf("malloc() - success, ptr = %p \n", ptr);

    ptr[petabyte - 1LL] = 10;
    printf("ptr[petabyte - 1] = 10; - success \n");

    printf("ptr[petabyte - 1] = %d \n", (int)(ptr[petabyte - 1LL]));

    free((void*)ptr);   // why the error is here?
    //printf("free() - success \n");

    return 0;
}

Result:

Error   time: 0 memory: 2292 signal:6
petabyte 1125899906842624 
malloc() - success, ptr = 0x823e008 
ptr[petabyte - 1] = 10; - success 
ptr[petabyte - 1] = 10 

And we can successfully get access (store/load) to the last member of petabyte, but why do we get an error on free((void*)ptr);?

Note: https://en.wikipedia.org/wiki/Petabyte

  • 1000^5 PB petabyte
  • 1024^5 PiB pebibyte - I use it

So really if we want to allocate more than RAM + swap and to work around overcommit_memory limit, then we can allocate memory by using VirtualAllocEx() on Windows, or mmap() on Linux, for example:

  • 16 TiB (16 * 2^40 bytes) then we can use example from Nominal Animal's answer: https://stackoverflow.com/a/38574719/1558037
  • 127 TiB (127 * 2^40 bytes) then we can use mmap() with flags MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS and fd=-1: http://coliru.stacked-crooked.com/a/c69ce8ad7fbe4560
like image 372
Alex Avatar asked Jul 25 '16 12:07

Alex


2 Answers

I believe that your problem is that malloc() does not take a long long int as its argument. It takes a size_t.

After changing your code to define petabyte as a size_t your program no longer returns a pointer from malloc. It fails instead.

I think that your array access setting petabyte-1 to 10 is writing far, far outside of the array malloc returned. That's the crash.

Always use the correct data types when calling functions.

Use this code to see what's going on:

long long int petabyte = 1024LL * 1024LL * 1024LL * 1024LL * 1024LL;
size_t ptest = petabyte;
printf("petabyte %lld %lu\n", petabyte, ptest);

If I compile in 64 bit mode it fails to malloc 1 petabyte. If I compile in 32 bit mode it mallocs 0 bytes, successfully, then attempts to write outside its array and segfaults.

like image 200
Zan Lynx Avatar answered Nov 20 '22 05:11

Zan Lynx


(This is not an answer, but an important note on anybody working with large datasets in Linux)

That is not how you use very large -- on the order of terabytes and up -- datasets in Linux.

When you use malloc() or mmap() (the GNU C library will use mmap() internally for large allocations anyway) to allocate private memory, the kernel limits the size to the size of (theoretically) available RAM and SWAP, multiplied by the overcommit factor.

Simply put, we know that larger-than-RAM datasets may have to be swapped out, so the size of the current swap will affect how large allocations are allowed.

To work around that, we create a file to be used as "swap" for the data, and map it using the MAP_NORESERVE flag. This tells the kernel that we don't want to use standard swap for this mapping. (It also means that if, for any reason, the kernel cannot get a new backing page, the application will get a SIGSEGV signal and die.)

Most filesystems in Linux support sparse files. This means that you can have a terabyte-sized file, that only takes a few kilobytes of actual disk space, if most of its contents are not written (and are thus zeroes). (Creating sparse files is easy; you simply skip over long runs of zeroes. Hole-punching is more difficult, as writing zeroes does use normal disk space, other methods need to be used instead.)

Here is an example program that you can use for exploration, mapfile.c:

#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    const char    *filename;
    size_t         page, size;
    int            fd, result;
    unsigned char *data;
    char           dummy;

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s MAPFILE BYTES\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    page = sysconf(_SC_PAGESIZE);
    if (page < 1) {
        fprintf(stderr, "Unknown page size.\n");
        return EXIT_FAILURE;
    }

    filename = argv[1];
    if (!filename || !*filename) {
        fprintf(stderr, "No map file name specified.\n");
        return EXIT_FAILURE;
    }

    if (sscanf(argv[2], " %zu %c", &size, &dummy) != 1 || size < 3) {
        fprintf(stderr, "%s: Invalid size in bytes.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (size % page) {
        /* Round up to next multiple of page */
        size += page - (size % page);
        fprintf(stderr, "Adjusted to %zu pages (%zu bytes)\n", size / page, size);
    }

    do {
        fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1) {
        fprintf(stderr, "Cannot create %s: %s.\n", filename, strerror(errno));
        return EXIT_FAILURE;
    }

    do {
        result = ftruncate(fd, (off_t)size);
    } while (result == -1 && errno == EINTR);
    if (result == -1) {
        fprintf(stderr, "Cannot resize %s: %s.\n", filename, strerror(errno));
        unlink(filename);
        close(fd);
        return EXIT_FAILURE;
    }

    data = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
    if ((void *)data == MAP_FAILED) {
        fprintf(stderr, "Mapping failed: %s.\n", strerror(errno));
        unlink(filename);
        close(fd);
        return EXIT_FAILURE;
    }

    fprintf(stderr, "Created file '%s' to back a %zu-byte mapping at %p successfully.\n", filename, size, (void *)data);

    fflush(stdout);
    fflush(stderr);

    data[0] = 1U;
    data[1] = 255U;

    data[size-2] = 254U;
    data[size-1] = 127U;

    fprintf(stderr, "Mapping accessed successfully.\n");

    munmap(data, size);
    unlink(filename);
    close(fd);

    fprintf(stderr, "All done.\n");
    return EXIT_SUCCESS;
}

Compile it using e.g.

gcc -Wall -O2 mapfile.c -o mapfile

and run it without arguments to see the usage.

The program simply sets up a mapping (adjusted to a multiple of the current page size), and accesses the first two and last two bytes of the mapping.

On my machine, running a 4.2.0-42-generic #49~14.04.1-Ubuntu SMP kernel on x86-64, on an ext4 filesystem, I cannot map a full petabyte. The maximum seems to be about 17,592,186,040,320 bytes (244-4096) -- 16 TiB - 4 KiB --, which comes to 4,294,967,296 pages of 4096 bytes (232 pages of 212 bytes each). It looks like the limit is imposed by the ext4 filesystem, as the failure occurs in the ftruncate() call (before the mapping is even tried).

(On a tmpfs I can get up to about 140,187,732,541,440 bytes or 127.5 TiB, but that's just a gimmick, because tmpfs is backed by RAM and swap, not an actual storage device. So it's not an option for real big data work. I seem to recall xfs would work for really large files, but I'm too lazy to format a partition or even look up the specs; I don't think anybody will actually read this post, even though the information herein has been very useful to me over the last decade or so.)

Here's how that example run looks on my machine (using a Bash shell):

$ ./mapfile datafile $[(1<<44)-4096]
Created file 'datafile' to back a 17592186040320-byte mapping at 0x6f3d3e717000 successfully.
Mapping accessed successfully.
All done.

.

like image 26
Nominal Animal Avatar answered Nov 20 '22 05:11

Nominal Animal