linux high kernel cpu usage on memory initialization

Question

I have a problem with high CPU cunsumption by the linux kernel, while bootstrapping my java applications on server. This problem occurs only in production, on dev servers everything is light-speed.

upd9: There was two question about this issue:

How to fix it? - Nominal Animal suggested to sync and drop everything, and it really helps. sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; Works. upd12: But indeed sync is enough.
Why this happening? - It is still open for me, I do understand that flushing durty pages to disk consumes Kernel CPU and IO time, it is normal. But what is strage, why even single threaded application written in "C" my load ALL cores by 100% in kernel space?

Due to ref-upd10 and ref-upd11 I have an idea that echo 3 > /proc/sys/vm/drop_caches is not required to fix my problem with slow memory allocation. It should be enough to run `sync' before starting memory-consuming application. Probably will try this tommorow in production and post results here.

upd10: Lost of FS caches pages case:

I executed cat 10GB.fiel > /dev/null, then
sync to be sure, no durty pages (cat /proc/meminfo |grep ^Dirty displayed 184kb.
Checking cat /proc/meminfo |grep ^Cached I got: 4GB cached
Running int main(char**) I got normal performance (like 50ms to initialize 32MB of allocated data).
Cached memory reduced to 900MB
Test summary: I think it is no problem for linux to reclaim pages used as FS cache into allocated memory.

upd11: LOTS of dirty pages case.

List item
I run my HowMongoDdWorks example with commented read part, and after some time
/proc/meminfo said that 2.8GB is Dirty and a 3.6GB is Cached.
I stopped HowMongoDdWorks and run my int main(char**).
Here is part of result:

init 15, time 0.00s x 0 [try 1/part 0] time 1.11s x 1 [try 2/part 0] time 0.04s x 0 [try 1/part 1] time 1.04s x 1 [try 2/part 1] time 0.05s x 0 [try 1/part 2] time 0.42s x 1 [try 2/part 2] time 0.04s
Summary by test: lost of durty pages significantly slow down first access to allocated memory (to be fair, this starts to happen only when total application memory starts to be comparable to whole OS memory, i.e. if you have 8 of 16 GB free, it is no problem to allocate 1GB, slowdown starst from 3GB or so).

Now I managed to reproduce this situation in my dev environment, so here is new details.

Dev machine configuration:

Linux 2.6.32-220.13.1.el6.x86_64 - Scientific Linux release 6.1 (Carbon)
RAM: 15.55 GB
CPU: 1 X Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz (4 threads) (physical)

It's 99.9% that problem caused by large amount of durty pages in FS cache. Here is the application which creates lots on dirty pages:

import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;

/**
 * @author dmitry.mamonov
 *         Created: 10/2/12 2:53 PM
 */
public class HowMongoDdWorks{
    public static void main(String[] args) throws IOException {
        final long length = 10L*1024L*1024L*1024L;
        final int pageSize = 4*1024;
        final int lengthPages = (int) (length/pageSize);
        final byte[] buffer = new byte[pageSize];
        final Random random = new Random();
        System.out.println("Init file");
        final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
        raf.setLength(length);
        int written = 0;
        int readed = 0;
        System.out.println("Test started");
        while(true){
            { //write.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.write(buffer);
                written++;
            }
            { //read.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.read(buffer);
                readed++;
            }
            if (written % 1024==0 || readed%1024==0){
                System.out.printf("W %10d R %10d pages
", written, readed);
            }

        }
    }
}

And here is test application, which causes HI (up to 100% by all cores) CPU load in Kernel Space (same as below, but I will copy it once again).

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
   int last = clock(); //remember the time
   for(int i=0;i<16;i++){ //repeat test several times
      int size = 256 * 1024 * 1024;
      int size4=size/4;
      int* buffer = malloc(size); //allocate 256MB of memory
      for(int k=0;k<2;k++){ //initialize allocated memory twice
          for(int j=0;j<size4;j++){ 
              //memory initialization (if I skip this step my test ends in 
              buffer[j]=k; 0.000s
          }
          //printing 
          printf(x "[%d] %.2f
",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
          last = clock();
      }
   }
   return 0;
}

While previous HowMongoDdWorks program is running, int main(char** argv) will show results like this:

x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

I keep everything below this line only for historical perpose.

upd1: both, dev and production systems are big enought for this test. upd7: it is not paging, at least I have not seen any storage IO activity during problem time.

dev ~ 4 Cores, 16 GM RAM, ~8 GB free
production ~ 12 Cores, 24 GB RAM, ~ 16 GB free (from 8 to 10 GM is under FS Cache, but it no difference, same results even if all 16GM is totally free), also this machine is loaded by CPU, but not too high ~10%.

upd8(ref): New test case and and potentional explanation see in tail.

Here is my test case (I also tested java and python, but "c" should be most clear):

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
   int last = clock(); //remember the time
   for(int i=0;i<16;i++){ //repeat test several times
      int size = 256 * 1024 * 1024;
      int size4=size/4;
      int* buffer = malloc(size); //allocate 256MB of memory
      for(int k=0;k<2;k++){ //initialize allocated memory twice
          for(int j=0;j<size4;j++){ 
              //memory initialization (if I skip this step my test ends in 
              buffer[j]=k; 0.000s
          }
          //printing 
          printf(x "[%d] %.2f
",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
          last = clock();
      }
   }
   return 0;
}

The output on dev machine (partial):

x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...

The output on production machine (partial):

x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

While running this test on development machine the CPU usage is not even rised from gound, like all cores is less then 5% usage in htop.

But running this test on production machine I see up to 100% CPU usage by ALL cores (in average load rises up to 50% on 12 cores machine), and it's all Kernel time.

upd2: all machines has same centos linux 2.6 installed, I work with them using ssh.

upd3: A: It's unlikely to be swapping, haven't seen any disk activity during my test, and plenty of RAM is also free. (also, descriptin is updated). – Dmitry 9 mins ago

upd4: htop say HI CPU utilisation by Kernel, up to 100% utilisation by al cores (on prod).

upd5: does CPU utilization settle down after initialization completes? In my simple test - Yes. For real application, it is only helping to stop everything else to start a new program (which is nonsense).

I have two questions:

Why this happening?
How to fix it?

upd8: Improved test and explain.

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
    const int partition = 8;
   int last = clock();
   for(int i=0;i<16;i++){
       int size = 256 * 1024 * 1024;
       int size4=size/4;
       int* buffer = malloc(size);
       buffer[0]=123;
       printf("init %d, time %.2fs
",i, (clock()-last)/(double)CLOCKS_PER_SEC);
       last = clock();
       for(int p=0;p<partition;p++){
            for(int k=0;k<2;k++){
                for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
                    buffer[j]=k;
                }
                printf("x [try %d/part %d] time %.2fs
",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
                last = clock();
            }
      }
   }
   return 0;
}

And result looks like this:

init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...

Facts I learned from this test.

Memory allocation itself is fast.
First access to allocated memory is fast (so it is not a lazy buffer allocation problem).
I split allocated buffer into parts (8 in test).
And filling each buffer part with value 0, and then with value 1, printing consumed time.
Second buffer part fill is always fast.
BUT furst buffer part fill is always a bit slower then second fill (I belive some extra work is done my kernel on first page access).
Some Times it takes SIGNIFICANTLY longer to fill buffer part with values at first time.

I tried suggested anwser and it seems it helped. I will recheck and post results again later.

Looks like linux maps allocated pages to durty file system cache pages, and it takes a lot of time to flush pages to disk one by one. But total sync works fast and eliminates problem.

Nominal Animal · Accepted Answer

Run

sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'

on your dev machine. It is a safe, nondestructive way to make sure your caches are empty. (You will not lose any data by running the above command, even if you happen to save or write to disk at the exact same time. It really is safe.)

Then, make sure you don't have any Java stuff running, and re-run the above command just to be sure. You can check if you have any Java running for example

ps axu | sed -ne '/ sed -ne /d; /java/p'

It should output nothing. If it does, close your Java stuff first.

Now, re-run your application test. Does the same slowdown now occur on your dev machine, too?

If you care to leave the comment either way, Dmitry, I'd be happy to explore the issue further.

Edited to add: I suspect that the slowdown does occur, and is due to the large startup latency incurred by Java itself. It is a very common issue, and basically built-in to Java, a result of its architecture. For larger applications, startup latency is often a significant fraction of a second, no matter how fast the machine, simply because Java has to load and prepare the classes (mostly serially, too, so adding cores will not help).

In other words, I believe the blame should fall on Java, not Linux; quite the opposite, since Linux manages to mitigate the latency on your development machine through kernel-level caching -- and that only because you keep running those Java components practically all the time, so the kernel knows to cache them.

Edit 2: It would be very useful to see which files your Java environment accesses when your application is started. You can do this with strace:

strace -f -o trace.log -q -tt -T -e trace=open COMMAND...

which creates file trace.log containing the open() syscalls done by any of the processes started by COMMAND.... To save the output to trace.PID for each process the COMMAND... starts, use

strace -f -o trace -ff -q -tt -T -e trace=open COMMAND...

Comparing the outputs on your dev and prod installations will tell you if they are truly equivalent. One of them may have extra or missing libraries, affecting the startup time.

In case an installation is old and system partition is reasonably full, it is possible that those files have been fragmented, causing the kernel to spend more time waiting for I/O to complete. (Note that the amount of I/O stays the same; only the time it takes to complete will increase if the files are fragmented.) You can use command

LANG=C LC_ALL=C sed -ne 's|^[^"]* open("$.*$", O[^"]*$|\1|p' trace.* \
| LANG=C LC_ALL=C sed -ne 's|^[^"]* open("$.*$", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r -d '\n' filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

to check how fragmented the files used by your application are; it reports how many files use only one, or more than one extents. Note that it does not include the original executable (COMMAND...), only the files it accesses.

If you just want to get the fragmentation statistics for files accessed by a single command, you can use

LANG=C LC_ALL=C strace -f -q -tt -T -e trace=open COMMAND... 2>&1 \
| LANG=C LC_ALL=C sed -ne 's|^[0-9:.]* open("$.*$", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

If the problem is not due to caching, then I think it is most likely that the two installations are not truly equivalent. If they are, then I'd check the fragmentation. After that, I'd do a full trace (omitting the -e trace=open) on both environments to see where exactly the differences are.

I do believe I now understand your problem/situation.

On your prod environment, kernel page cache is mostly dirty, i.e. most cached stuff is stuff that is going to be written to disk.

When your application allocates new pages, the kernel only sets up the page mappings, it does not actually give physical RAM immediately. That only happens on the first access to each page.

On the first access, the kernel first locates a free page -- typically, a page that contains "clean" cached data, i.e. something read from the disk but not modified. Then, it clears it to zeros, to avoid information leaks between processes. (When using the C library allocation facilities like malloc() etc., instead of the direct mmap() family of functions, the library may use/reuse parts of the mapping. Although the kernel does clear the pages to zeros, the library may "dirty" them. Using mmap() to get anonymous pages you do get them zeroed out.)

If the kernel does not have suitable clean pages at hand, it must first flush some of the oldest dirty pages to disk first. (There are processes inside the kernel that flush pages to disk, and mark them clean, but if the server load is such that pages are continuously dirtied, it is usually desirable to have mostly dirty pages instead of mostly clean pages -- the server gets more work done that way. Unfortunately, it does also mean an increase in the first page access latency, which you're now encountered.)

Each page is sysconf(_SC_PAGESIZE) bytes long, aligned. In other words, pointer p points to the start of a page if and only if ((long)p % sysconf(_SC_PAGESIZE)) == 0. Most kernels, I believe, do actually populate groups of pages in most cases instead of individual pages, thus increasing the latency of the first access (to each group of pages).

Finally, there may be some compiler optimization that plays havoc with your benchmarking. I recommend you write a separate source file for a benchmarking main(), and the actual work done on each iteration in a separate file. Compile them separately, and just link them together, to make sure the compiler does not rearrange the time functions wrt. the actual work done. Basically, in benchmark.c:

#define _POSIX_C_SOURCE 200809L
#include <time.h>
#include <stdio.h>

/* in work.c, adjust as needed */
void work_init(void);      /* Optional, allocations etc. */
void work(long iteration); /* Completely up to you, including parameters */
void work_done(void);      /* Optional, deallocations etc. */

#define PRIMING    0
#define REPEATS  100

int main(void)
{
    double          wall_seconds[REPEATS];
    struct timespec wall_start, wall_stop;
    long            iteration;

    work_init();

    /* Priming: do you want caches hot? */
    for (iteration = 0L; iteration < PRIMING; iteration++)
        work(iteration);

    /* Timed iterations */
    for (iteration = 0L; iteration < REPEATS; iteration++) {
        clock_gettime(CLOCK_REALTIME, &wall_start);
        work(iteration);
        clock_gettime(CLOCK_REALTIME, &wall_stop);
        wall_seconds[iteration] = (double)(wall_stop.tv_sec - wall_start.tv_sec)
                                + (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
    }

    work_done();

    /* TODO: wall_seconds[0] is the first iteration.
     *       Comparing to successive iterations (assuming REPEATS > 0)
     *       tells you about the initial latency.
    */

    /* TODO: Sort wall_seconds, for easier statistics.
     *       Most reliable value is the median, with half of the
     *       values larger and half smaller.
     *       Personally, I like to discard first and last 15.85%
     *       of the results, to get "one-sigma confidence" interval.
    */

    return 0;
}

with the actual array allocation, deallocation, and filling (per repeat loop) done in the work() functions defined in work.c.

linux high kernel cpu usage on memory initialization

Tags:

c

linux

cpu

kernel

allocation

Dmitry

1 Answers

Nominal Animal

Recent Activity

Donate For Us

linux high kernel cpu usage on memory initialization

Tags:

c

linux

cpu

kernel

allocation

Dmitry

1 Answers

Nominal Animal

Related questions

Recent Activity

Donate For Us