Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allocating copy on write memory within a process

I have a memory segment which was obtained via mmap with MAP_ANONYMOUS.

How can I allocate a second memory segment of the same size which references the first one and make both copy-on write in Linux (Working Linux 2.6.36 at the moment)?

I want to have exactly the same effect as fork, just without creating a new process. I want the new mapping to stay in the same process.

The whole process has to be repeatable on both the origin and copy pages (as if parent and child would continue to fork).

The reason why I don't want to allocate a straight copy of the whole segment is because they are multiple gigabytes large and I don't want to use memory which could be copy-on-write shared.

What I have tried:

mmap the segment shared, anonymous. On duplication mprotect it to read-only and create a second mapping with remap_file_pages also read-only.

Then use libsigsegv to intercept write attempts, manually make a copy of the page and then mprotect both to read-write.

Does the trick, but is very dirty. I am essentially implementing my own VM.

Sadly mmaping /proc/self/mem is not supported on current Linux, otherwise a MAP_PRIVATE mapping there could do the trick.

Copy-on-write mechanics are part of the Linux VM, there has to be a way to make use of them without creating a new process.

As a note: I have found the appropriate mechanics in the Mach VM.

The following code compiles on my OS X 10.7.5 and has the expected behaviour: Darwin 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64 i386

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

#include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #ifdef __MACH__ #include <mach/mach.h> #endif   int main() {      mach_port_t this_task = mach_task_self();      struct {         size_t rss;         size_t vms;         void * a1;         void * a2;         char p1;         char p2;         } results[3];      size_t length = sysconf(_SC_PAGE_SIZE);     vm_address_t first_address;     kern_return_t result = vm_allocate(this_task, &first_address, length, VM_FLAGS_ANYWHERE);      if ( result != ERR_SUCCESS ) {         fprintf(stderr, "Error allocating initial 0x%zu memory.\n", length);            return -1;     }      char * first_address_p = first_address;     char * mirror_address_p;     *first_address_p = 'a';      struct task_basic_info t_info;     mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[0].rss = t_info.resident_size;     results[0].vms = t_info.virtual_size;     results[0].a1 = first_address_p;     results[0].p1 = *first_address_p;      vm_address_t mirrorAddress;     vm_prot_t cur_prot, max_prot;     result = vm_remap(this_task,                       &mirrorAddress,   // mirror target                       length,    // size of mirror                       0,                 // auto alignment                       1,                 // remap anywhere                       this_task,  // same task                       first_address,     // mirror source                       1,                 // Copy                       &cur_prot,         // unused protection struct                       &max_prot,         // unused protection struct                       VM_INHERIT_COPY);      if ( result != ERR_SUCCESS ) {         perror("vm_remap");         fprintf(stderr, "Error remapping pages.\n");               return -1;     }      mirror_address_p = mirrorAddress;      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[1].rss = t_info.resident_size;     results[1].vms = t_info.virtual_size;     results[1].a1 = first_address_p;     results[1].p1 = *first_address_p;     results[1].a2 = mirror_address_p;     results[1].p2 = *mirror_address_p;      *mirror_address_p = 'b';      task_info(this_task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);     results[2].rss = t_info.resident_size;     results[2].vms = t_info.virtual_size;     results[2].a1 = first_address_p;     results[2].p1 = *first_address_p;     results[2].a2 = mirror_address_p;     results[2].p2 = *mirror_address_p;      printf("Allocated one page of memory and wrote to it.\n");     printf("*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[0].a1, results[0].p1, results[0].rss, results[0].vms);     printf("Cloned that page copy-on-write.\n");     printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[1].a1, results[1].p1,results[1].a2, results[1].p2, results[1].rss, results[1].vms);     printf("Wrote to the new cloned page.\n");     printf("*%p = '%c'\n*%p = '%c'\nRSS: %zu\tVMS: %zu\n",results[2].a1, results[2].p1,results[2].a2, results[2].p2, results[2].rss, results[2].vms);      return 0; } 

I want the same effect in Linux.

like image 407
Sergey L. Avatar asked Jun 06 '13 14:06

Sergey L.


People also ask

What is copy-on-write memory?

Copy-on-write or CoW is a technique to efficiently copy data resources in a computer system. If a unit of data is copied but not modified, the "copy" can exist as a reference to the original data. Only when the copied data is modified is a copy created, and new bytes are actually written.

Is copy-on-write efficient for memory utilization?

The CoW is basically a technique of efficiently copying the data resources in the computer system.

What does it mean that actions are copy-on-write?

"Copy on write" means more or less what it sounds like: everyone has a single shared copy of the same data until it's written, and then a copy is made. Usually, copy-on-write is used to resolve concurrency sorts of problems.

What is advantage of the copy-on-write operation?

The copy-on-write technique can be extended to support efficient memory allocation by having a page of physical memory filled with zeros.


2 Answers

I tried to achieve the same thing (in fact, its sightly simpler as I only need to take snapshots of a live region, I do not need to take copies of the copies). I did not find a good solution for this.

Direct kernel support (or the lack thereof): By modifying/adding a module it should be possible to achieve this. However there is no simple way to setup a new COW region from an existing one. The code used by fork (copy_page_rank) copy a vm_area_struct from one process/virtual address space to another (new one) but assumes that the address of the new mapping is the same as the address of the old one. If one want to implement a "remap" feature, the function must be modified/duplicated in order to copy a vm_area_struct with address translation.

BTRFS: I thought of using COW on btrfs for this. I wrote a simple program mapping two reflink-ed files and tried to map them. However, looking at the page information with /proc/self/pagemap shows the two instances of the file do not share the same cache pages. (At least unless my test is wrong). So you will not gain much by doing this. The physical pages of the same data will not be shared among different instances.

#include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> #include <unistd.h> #include <inttypes.h> #include <stdio.h>  void* map_file(const char* file) {   struct stat file_stat;   int fd = open(file, O_RDWR);   assert(fd>=0);   int temp = fstat(fd, &file_stat);   assert(temp==0);   void* res = mmap(NULL, file_stat.st_size, PROT_READ, MAP_SHARED, fd, 0);   assert(res!=MAP_FAILED);   close(fd);   return res; }  static int pagemap_fd = -1;  uint64_t pagemap_info(void* p) {   if(pagemap_fd<0) {     pagemap_fd = open("/proc/self/pagemap", O_RDONLY);     if(pagemap_fd<0) {       perror("open pagemap");       exit(1);     }   }   size_t page = ((uintptr_t) p) / getpagesize();   int temp = lseek(pagemap_fd, page*sizeof(uint64_t), SEEK_SET);   if(temp==(off_t) -1) {     perror("lseek");     exit(1);   }   uint64_t value;   temp = read(pagemap_fd, (char*)&value, sizeof(uint64_t));   if(temp<0) {     perror("lseek");     exit(1);   }   if(temp!=sizeof(uint64_t)) {     exit(1);   }   return value; }  int main(int argc, char** argv) {     char* a = (char*) map_file(argv[1]);   char* b = (char*) map_file(argv[2]);      int fd = open("/proc/self/pagemap", O_RDONLY);   assert(fd>=0);    int x = a[0];     uint64_t info1 = pagemap_info(a);    int y = b[0];   uint64_t info2 = pagemap_info(b);    fprintf(stderr, "%" PRIx64 " %" PRIx64 "\n", info1, info2);    assert(info1==info2);    return 0; } 

mprotect+mmap anonymous pages: It does not work in your case, but a solution is to use a MAP_SHARED file for my main memory region. On a snapshot, the file is mapped somewhere else and both instances are mprotected. On a write, a anonymous page in mapped in the snapshot, the data is copied in this new page and the original page is unprotected. However this solution does not work in your case as you will not be able to repeat the process in the snapshot (because it is not a plain MAP_SHARED area but a MAP_SHARED with some MAP_ANONYMOUS pages. Moreover it does not scale with the number of copies : if I have many COW copies, I will have to repeat the same process for each copy and this page will not be duplicated for the copies. And I can't map the anonymous page in the original area as it will not be possible to map the anonymous pages in the copies. This solution does not work in anyway.

mprotect+remap_file_pages: This looks like the only way do do this without touching the Linux kernel. The downside it that, in general, you will probably have to make a remap_file_page syscall for each page when doing a copy : it might not be that efficient to make a lot of syscalls. When deduplicating a shared page, you need at least to : remap_file_page a new/free page for the new written-to-page, m-un-protect the new page. It is necessary to reference count each page.

I do not think that the mprotect() based approaches would scale very well (if you handle a lot of memory like this). On Linux, mprotect() does not work at the memory page granularity but at the vm_area_struct granularity (the entries you find in /prod//maps). Doing a mprotect() at the memory page granularity will cause the kernel to constantly split and merge vm_area_struct:

  • you will end up with a very mm_struct;

  • looking up a vm_area_struct (which is used for a log of virtual memory related operations) is on O(log #vm_area_struct) but it might still have a negative performance impact;

  • memory consumption for those structures.

For this kind of reason, the remap_file_pages() syscall was created [http://lwn.net/Articles/24468/] in order to do non-linear memory mapping of a file. Doing this with mmap, requires a log of vm_area_struct. I don not event think that they this was designed for page granularity mapping: the remap_file_pages() is not very optimised for this use case as it would need a syscall per page.

I think the only viable solution is to let the kernel do it. It is possible to do it in userspace with remap_file_pages but it will probably be quite inefficient as a snapshot will in generate need a number of syscalls proportional in the number of pages. A variant of remap_file_pages might do the trick.

This approach however duplicate the page logic of the kernel. I tend to think we should let the kernel do this. All in all, an implementation in the kernel seems to be the better solution. For someone who knows this part of the kernel, it should be quite easy to do.

KSM (Kernel Samepage Merging): There is a thing that the kernel can do. It can try to deduplicate the pages. You will still have to copy the data, but the kernel should be able to merge them. You need to mmap a new anonymous area for your copy, copy it manually with memcpy and madvide(start, end, MADV_MERGEABLE) the areas. You need to enable KSM (in root):

echo 1 > /sys/kernel/mm/ksm/run echo 10000 > /sys/kernel/mm/ksm/pages_to_scan 

It works, it doesn't work so well with my workload but it's probably because the pages are not shared a lot in the end. The downside is that you still have to do the copy (you cannot have an efficient COW) and then the kernel will un-merge the page. It will generate page and cache faults when doing the copies, the KSM daemon thread will consume a lot of CPU (I have a CPU running at A00% for the whole simulation) and probably consume a log a cache. So you will not gain time when doing the copy but you might gain some memory. If your main motivation, is to use less memory in the long run and you do not care that much about avoiding the copies, this solution might work for you.

like image 115
ysdx Avatar answered Sep 18 '22 12:09

ysdx


Hmm... you could create a file in /dev/shm with MAP_SHARED, write to it, then reopen it twice with MAP_PRIVATE.

like image 33
thejh Avatar answered Sep 22 '22 12:09

thejh