A separate loop slows down an independent earlier loop?

Question

How can a separate loop affect the performance of an independent earlier loop?

My first loop reads some large text files and counts the lines/rows. After a malloc, the second loop populates the allocated matrix.

If the second loop is commented out, the first loop takes 1.5 sec. However, compiling WITH the second loop slows down the first loop, which now takes 30-40 seconds!

In other words: the second loop somehow slows down the first loop. I have tried to change scope, change compilers, change compiler flags, change the loop itself, bring everything into main(), use boost::iostream and even place one loop in a shared library, but in every attempt the same problem persists!

The first loop is fast until the program is compiled with the second loop.

EDIT: Here is a full example of my problem ------------>

#include <iostream>
#include <vector>
#include "string.h"
#include "boost/chrono.hpp"
#include "sys/mman.h"
#include "sys/stat.h"
#include "fcntl.h"
#include <algorithm>

unsigned long int countLines(char const *fname) {
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1) {
        std::cout << "Open Error" << std::endl;
        std::exit(EXIT_FAILURE);
    }

    posix_fadvise(fd, 0, 0, 1); 
    char buf[BUFFER_SIZE + 1];
    unsigned long int lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) {
        if(bytes_read == (size_t)-1) {
            std::cout << "Read Failed" << std::endl;
            std::exit(EXIT_FAILURE);
        }
        if (!bytes_read)
            break;

        int n;
        char *p;
        for(p = buf, n=bytes_read ; n > 0 && (p = (char*) memchr(p, '
', n)) ; n = (buf+bytes_read) - ++p)
            ++lines;
    }
    close(fd);
    return lines;
}

int main(int argc, char *argv[])
{
    // initial variables
    int offset = 55;  
    unsigned long int rows = 0;
    unsigned long int cols = 0;
    std::vector<unsigned long int> dbRows = {0, 0, 0};
    std::vector<std::string> files = {"DATA/test/file1.csv",  // large files: 3Gb 
                                      "DATA/test/file2.csv",  // each line is 55 chars long 
                                      "DATA/test/file3.csv"};

    // find each file's number of rows 
    for (int x = 0; x < files.size(); x++) {   // <--- FIRST LOOP **
        dbRows[x] = countLines(files[x].c_str());
    }

    // define matrix row as being the largest row found 
    // define matrix col as being 55 chars long for each csv file
    std::vector<unsigned long int>::iterator maxCount;
    maxCount = std::max_element(dbRows.begin(), dbRows.end());
    rows = dbRows[std::distance(dbRows.begin(), maxCount)];   // typically rows = 72716067
    cols = dbRows.size() * offset;                            //           cols = 165 

    // malloc required space (11998151055)
    char *syncData = (char *)malloc(rows*cols*sizeof(char));

    // fill up allocated memory with a test letter
    char t[]= "x";
    for (unsigned long int x = 0; x < (rows*cols); x++) {   // <--- SECOND LOOP **
        syncData[x] = t[0];
    } 

    free(syncData);
    return 0; 
}

I have also noticed that lowering the number of columns speeds up the first loop.

A profiler points the finger to this line:

while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))

The program idles on this line for 30 seconds or a wait count of 230,000. In assembly, the wait count occurs on:

Block 5:
lea 0x8(%rsp), %rsi
mov %r12d, %edi
mov $0x4000, %edx
callq  0x402fc0     <------ stalls on callq
Block 6:
mov %rax, %rbx
test %rbx, %rbx
jz 0x404480 <Block 18>

My guess is that an API block occurs when reading from stream, but I don't know why?

Peter Cordes · Accepted Answer

My theory:

Allocating and touching all that memory evicts the big files from disk cache, so the next run has to read them from disk.

If you ran the version without loop2 a couple times to warm up the disk cache, then run a version with loop2, I predict it will be fast the first time, but slow for further runs without warming up the disk cache again first.

The memory consumption happens after the files have been read. This causes "memory pressure" on the page cache (aka disk cache), causing it to evict data from the cache to make room for the pages for your process to write to.

Your computer probably has just barely enough free RAM to cache your working set. Closing your web browser might free up enough to make a difference! Or not, since your 11998151055 is 11.1GiB, and you're writing every page of it. (Every byte, even. You could do it with memset for higher performance, although I assume what you've shown is just a dummy version)

BTW, another tool for investigating this would be time ./a.out. It can show you if your program is spending all its CPU time in user-space vs. kernel ("system") time.

If user+sys adds up to the real time, your process is CPU bound. If not, it's I/O bound, and your process is blocking on disk I/O (which is normal, since counting newlines should be fast).

A separate loop slows down an independent earlier loop?

Tags:

c++

performance

loops

for-loop

I have also noticed that lowering the number of columns speeds up the first loop.

Harry Reed

1 Answers

Peter Cordes

Recent Activity

Donate For Us

A separate loop slows down an independent earlier loop?

Tags:

c++

performance

loops

for-loop

I have also noticed that lowering the number of columns speeds up the first loop.

Harry Reed

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us