We have built an in memory database, which eats about 100-150G RAM in a single Vec, which is populated like this:
let mut result = Vec::with_capacity(a_very_large_number);
while let Ok(n) = reader.read(&mut buffer) {
result.push(...);
}
perf top
shows that the time is mostly spent in this "change_protection" function:
Samples: 48K of event 'cpu-clock', Event count (approx.): 694742858
62.45% [kernel] [k] change_protection
18.18% iron [.] database::Database::init::h63748
7.45% [kernel] [k] vm_normal_page
4.88% libc-2.17.so [.] __memcpy_ssse3_back
0.92% [kernel] [k] copy_user_enhanced_fast_string
0.52% iron [.] memcpy@plt
The CPU usage of this function grows as more and more data is loaded into RAM:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12383 iron 20 0 137g 91g 1372 D 76.1 37.9 27:37.00 iron
The code is running on an r3.8xlarge AWS EC2 instance, and transparent hugepage is already disabled.
[~]$ cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]
[~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping : 4
microcode : 0x428
cpu MHz : 2500.070
cache size : 25600 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips : 5000.14
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
kernel
3.14.35-28.38.amzn1.x86_64
the real question is why is there so much overhead in that function?
This seems to be an OS issue, rather than an issue with this specific rust function.
Most OSes (including Linux) use demand paging. By default, Linux will not allocate physical pages for newly allocated memory. Instead it will allocate a single zero page with read-only permissions for all the allocated memory (i.e., all virtual memory pages will point to this single physical memory page).
If you attempt to write to the memory, a page fault will happen, a new page will be allocated, and it's permissions will be set accordingly.
I'm guessing that you are seeing this effect in your program. If you try to do the same thing a second time, it should be much faster. There are also ways to control this policy via sysctl
: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
Not sure why you disabled THP, but in this case large pages might help you since the protection change will happen once for every large page (2Mib) instead of once per normal page (4KiB).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With