Allocating copy on write memory within a process

Tags:

I have a memory segment which was obtained via mmap with MAP_ANONYMOUS.

How can I allocate a second memory segment of the same size which references the first one and make both copy-on write in Linux (Working Linux 2.6.36 at the moment)?

I want to have exactly the same effect as fork, just without creating a new process. I want the new mapping to stay in the same process.

The whole process has to be repeatable on both the origin and copy pages (as if parent and child would continue to fork).

The reason why I don't want to allocate a straight copy of the whole segment is because they are multiple gigabytes large and I don't want to use memory which could be copy-on-write shared.

What I have tried:

mmap the segment shared, anonymous. On duplication mprotect it to read-only and create a second mapping with remap_file_pages also read-only.

Then use libsigsegv to intercept write attempts, manually make a copy of the page and then mprotect both to read-write.

Does the trick, but is very dirty. I am essentially implementing my own VM.

Sadly mmaping /proc/self/mem is not supported on current Linux, otherwise a MAP_PRIVATE mapping there could do the trick.

Copy-on-write mechanics are part of the Linux VM, there has to be a way to make use of them without creating a new process.

As a note: I have found the appropriate mechanics in the Mach VM.

The following code compiles on my OS X 10.7.5 and has the expected behaviour: Darwin 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64 i386

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

#include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #ifdef __MACH__ #include <mach/mach.h> #endif   int main() {      mach_port_t this_task = mach_task_self();      struct {         size_t rss;         size_t vms;         void * a1;         void * a2;         char p1;         char p2;         } results[3];      size_t length = sysconf(_SC_PAGE_SIZE);     vm_address_t first_address;     kern_return_t result = vm_allocate(this_task, &first_address, length, VM_FLAGS_ANYWHERE);      if ( result != ERR_SUCCESS ) {         fprintf(stderr, "Error allocating initial 0x%zu memory.\n", length);            return -1;     }      char * first_address_p = first_address;     char * mirror_address_p;     *first_address_p = 'a';      struct task_basic_info t_info;     mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[0].rss = t_info.resident_size;     results[0].vms = t_info.virtual_size;     results[0].a1 = first_address_p;     results[0].p1 = *first_address_p;      vm_address_t mirrorAddress;     vm_prot_t cur_prot, max_prot;     result = vm_remap(this_task,                       &mirrorAddress,   // mirror target                       length,    // size of mirror                       0,                 // auto alignment                       1,                 // remap anywhere                       this_task,  // same task                       first_address,     // mirror source                       1,                 // Copy                       &cur_prot,         // unused protection struct                       &max_prot,         // unused protection struct                       VM_INHERIT_COPY);      if ( result != ERR_SUCCESS ) {         perror("vm_remap");         fprintf(stderr, "Error remapping pages.\n");               return -1;     }      mirror_address_p = mirrorAddress;      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[1].rss = t_info.resident_size;     results[1].vms = t_info.virtual_size;     results[1].a1 = first_address_p;     results[1].p1 = *first_address_p;     results[1].a2 = mirror_address_p;     results[1].p2 = *mirror_address_p;      *mirror_address_p = 'b';      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[2].rss = t_info.resident_size;     results[2].vms = t_info.virtual_size;     results[2].a1 = first_address_p;     results[2].p1 = *first_address_p;     results[2].a2 = mirror_address_p;     results[2].p2 = *mirror_address_p;      printf("Allocated one page of memory and wrote to it.\n");     printf("*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[0].a1, results[0].p1, results[0].rss, results[0].vms);     printf("Cloned that page copy-on-write.\n");     printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[1].a1, results[1].p1,results[1].a2, results[1].p2, results[1].rss, results[1].vms);     printf("Wrote to the new cloned page.\n");     printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[2].a1, results[2].p1,results[2].a2, results[2].p2, results[2].rss, results[2].vms);      return 0; }

I want the same effect in Linux.

407

asked Jun 06 '13 14:06

Sergey L.

2 Answers

I tried to achieve the same thing (in fact, its sightly simpler as I only need to take snapshots of a live region, I do not need to take copies of the copies). I did not find a good solution for this.

Direct kernel support (or the lack thereof): By modifying/adding a module it should be possible to achieve this. However there is no simple way to setup a new COW region from an existing one. The code used by fork (copy_page_rank) copy a vm_area_struct from one process/virtual address space to another (new one) but assumes that the address of the new mapping is the same as the address of the old one. If one want to implement a "remap" feature, the function must be modified/duplicated in order to copy a vm_area_struct with address translation.

BTRFS: I thought of using COW on btrfs for this. I wrote a simple program mapping two reflink-ed files and tried to map them. However, looking at the page information with /proc/self/pagemap shows the two instances of the file do not share the same cache pages. (At least unless my test is wrong). So you will not gain much by doing this. The physical pages of the same data will not be shared among different instances.

#include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> #include <unistd.h> #include <inttypes.h> #include <stdio.h>  void* map_file(const char* file) {   struct stat file_stat;   int fd = open(file, O_RDWR);   assert(fd>=0);   int temp = fstat(fd, &file_stat);   assert(temp==0);   void* res = mmap(NULL, file_stat.st_size, PROT_READ, MAP_SHARED, fd, 0);   assert(res!=MAP_FAILED);   close(fd);   return res; }  static int pagemap_fd = -1;  uint64_t pagemap_info(void* p) {   if(pagemap_fd<0) {     pagemap_fd = open("/proc/self/pagemap", O_RDONLY);     if(pagemap_fd<0) {       perror("open pagemap");       exit(1);     }   }   size_t page = ((uintptr_t) p) / getpagesize();   int temp = lseek(pagemap_fd, page*sizeof(uint64_t), SEEK_SET);   if(temp==(off_t) -1) {     perror("lseek");     exit(1);   }   uint64_t value;   temp = read(pagemap_fd, (char*)&value, sizeof(uint64_t));   if(temp<0) {     perror("lseek");     exit(1);   }   if(temp!=sizeof(uint64_t)) {     exit(1);   }   return value; }  int main(int argc, char** argv) {     char* a = (char*) map_file(argv[1]);   char* b = (char*) map_file(argv[2]);      int fd = open("/proc/self/pagemap", O_RDONLY);   assert(fd>=0);    int x = a[0];     uint64_t info1 = pagemap_info(a);    int y = b[0];   uint64_t info2 = pagemap_info(b);    fprintf(stderr, "%" PRIx64 " %" PRIx64 "\n", info1, info2);    assert(info1==info2);    return 0; }

mprotect+mmap anonymous pages: It does not work in your case, but a solution is to use a MAP_SHARED file for my main memory region. On a snapshot, the file is mapped somewhere else and both instances are mprotected. On a write, a anonymous page in mapped in the snapshot, the data is copied in this new page and the original page is unprotected. However this solution does not work in your case as you will not be able to repeat the process in the snapshot (because it is not a plain MAP_SHARED area but a MAP_SHARED with some MAP_ANONYMOUS pages. Moreover it does not scale with the number of copies : if I have many COW copies, I will have to repeat the same process for each copy and this page will not be duplicated for the copies. And I can't map the anonymous page in the original area as it will not be possible to map the anonymous pages in the copies. This solution does not work in anyway.

mprotect+remap_file_pages: This looks like the only way do do this without touching the Linux kernel. The downside it that, in general, you will probably have to make a remap_file_page syscall for each page when doing a copy : it might not be that efficient to make a lot of syscalls. When deduplicating a shared page, you need at least to : remap_file_page a new/free page for the new written-to-page, m-un-protect the new page. It is necessary to reference count each page.

I do not think that the mprotect() based approaches would scale very well (if you handle a lot of memory like this). On Linux, mprotect() does not work at the memory page granularity but at the vm_area_struct granularity (the entries you find in /prod//maps). Doing a mprotect() at the memory page granularity will cause the kernel to constantly split and merge vm_area_struct:

you will end up with a very mm_struct;
looking up a vm_area_struct (which is used for a log of virtual memory related operations) is on O(log #vm_area_struct) but it might still have a negative performance impact;
memory consumption for those structures.

For this kind of reason, the remap_file_pages() syscall was created [http://lwn.net/Articles/24468/] in order to do non-linear memory mapping of a file. Doing this with mmap, requires a log of vm_area_struct. I don not event think that they this was designed for page granularity mapping: the remap_file_pages() is not very optimised for this use case as it would need a syscall per page.

I think the only viable solution is to let the kernel do it. It is possible to do it in userspace with remap_file_pages but it will probably be quite inefficient as a snapshot will in generate need a number of syscalls proportional in the number of pages. A variant of remap_file_pages might do the trick.

This approach however duplicate the page logic of the kernel. I tend to think we should let the kernel do this. All in all, an implementation in the kernel seems to be the better solution. For someone who knows this part of the kernel, it should be quite easy to do.

KSM (Kernel Samepage Merging): There is a thing that the kernel can do. It can try to deduplicate the pages. You will still have to copy the data, but the kernel should be able to merge them. You need to mmap a new anonymous area for your copy, copy it manually with memcpy and madvide(start, end, MADV_MERGEABLE) the areas. You need to enable KSM (in root):

echo 1 > /sys/kernel/mm/ksm/run echo 10000 > /sys/kernel/mm/ksm/pages_to_scan

It works, it doesn't work so well with my workload but it's probably because the pages are not shared a lot in the end. The downside is that you still have to do the copy (you cannot have an efficient COW) and then the kernel will un-merge the page. It will generate page and cache faults when doing the copies, the KSM daemon thread will consume a lot of CPU (I have a CPU running at A00% for the whole simulation) and probably consume a log a cache. So you will not gain time when doing the copy but you might gain some memory. If your main motivation, is to use less memory in the long run and you do not care that much about avoiding the copies, this solution might work for you.

115

answered Sep 18 '22 12:09

ysdx

Hmm... you could create a file in /dev/shm with MAP_SHARED, write to it, then reopen it twice with MAP_PRIVATE.

answered Sep 22 '22 12:09

thejh

Related questions
                            
                                How to create a typedef for function pointers
                            
                                Query on -ffunction-section & -fdata-sections options of gcc
                            
                                Format specifier for 'long long'
                            
                                What is value of EOF and '\0' in C
                            
                                C hardcoded array as memcpy parameter
                            
                                Why isn't this pointer arithmetic allowed in C? [duplicate]
                            
                                How is the size of a struct with Bit Fields determined/measured?
                            
                                Accessing array values via pointer arithmetic vs. subscripting in C
                            
                                Send and Receive a file in socket programming in Linux with C/C++ (GCC/G++)
                            
                                Starting off a simple (the simplest perhaps) C compiler?
                            
                                How to represent FLOAT number in memory in C
                            
                                Does realloc overwrite old contents?
                            
                                Does C check if a pointer is out-of-bound without the pointer being dereferenced?
                            
                                Can a pointer (address) ever be negative?
                            
                                How do I extract specific 'n' bits of a 32-bit unsigned integer in C?
                            
                                Deleting a middle node from a single linked list when pointer to the previous node is not available
                            
                                What does the comma operator do?
                            
                                How can I tell if a given path is a directory or a file? (C/C++)
                            
                                Counting number of occurrences of a char in a string in C
                            
                                Calling OCaml-wrapped ZeroMQ code from signal handler

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Allocating copy on write memory within a process

Tags:

c

linux

virtual-memory

Sergey L.

People also ask

2 Answers

ysdx

thejh

Recent Activity

Donate For Us