How can a separate loop affect the performance of an independent earlier loop?
My first loop reads some large text files and counts the lines/rows. After a malloc, the second loop populates the allocated matrix.
If the second loop is commented out, the first loop takes 1.5 sec. However, compiling WITH the second loop slows down the first loop, which now takes 30-40 seconds!
In other words: the second loop somehow slows down the first loop. I have tried to change scope, change compilers, change compiler flags, change the loop itself, bring everything into main(), use boost::iostream and even place one loop in a shared library, but in every attempt the same problem persists!
The first loop is fast until the program is compiled with the second loop.
EDIT: Here is a full example of my problem ------------>
#include <iostream>
#include <vector>
#include "string.h"
#include "boost/chrono.hpp"
#include "sys/mman.h"
#include "sys/stat.h"
#include "fcntl.h"
#include <algorithm>
unsigned long int countLines(char const *fname) {
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1) {
std::cout << "Open Error" << std::endl;
std::exit(EXIT_FAILURE);
}
posix_fadvise(fd, 0, 0, 1);
char buf[BUFFER_SIZE + 1];
unsigned long int lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) {
if(bytes_read == (size_t)-1) {
std::cout << "Read Failed" << std::endl;
std::exit(EXIT_FAILURE);
}
if (!bytes_read)
break;
int n;
char *p;
for(p = buf, n=bytes_read ; n > 0 && (p = (char*) memchr(p, '\n', n)) ; n = (buf+bytes_read) - ++p)
++lines;
}
close(fd);
return lines;
}
int main(int argc, char *argv[])
{
// initial variables
int offset = 55;
unsigned long int rows = 0;
unsigned long int cols = 0;
std::vector<unsigned long int> dbRows = {0, 0, 0};
std::vector<std::string> files = {"DATA/test/file1.csv", // large files: 3Gb
"DATA/test/file2.csv", // each line is 55 chars long
"DATA/test/file3.csv"};
// find each file's number of rows
for (int x = 0; x < files.size(); x++) { // <--- FIRST LOOP **
dbRows[x] = countLines(files[x].c_str());
}
// define matrix row as being the largest row found
// define matrix col as being 55 chars long for each csv file
std::vector<unsigned long int>::iterator maxCount;
maxCount = std::max_element(dbRows.begin(), dbRows.end());
rows = dbRows[std::distance(dbRows.begin(), maxCount)]; // typically rows = 72716067
cols = dbRows.size() * offset; // cols = 165
// malloc required space (11998151055)
char *syncData = (char *)malloc(rows*cols*sizeof(char));
// fill up allocated memory with a test letter
char t[]= "x";
for (unsigned long int x = 0; x < (rows*cols); x++) { // <--- SECOND LOOP **
syncData[x] = t[0];
}
free(syncData);
return 0;
}
A profiler points the finger to this line:
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
The program idles on this line for 30 seconds or a wait count of 230,000. In assembly, the wait count occurs on:
Block 5:
lea 0x8(%rsp), %rsi
mov %r12d, %edi
mov $0x4000, %edx
callq 0x402fc0 <------ stalls on callq
Block 6:
mov %rax, %rbx
test %rbx, %rbx
jz 0x404480 <Block 18>
My guess is that an API block occurs when reading from stream, but I don't know why?
My theory:
Allocating and touching all that memory evicts the big files from disk cache, so the next run has to read them from disk.
If you ran the version without loop2 a couple times to warm up the disk cache, then run a version with loop2, I predict it will be fast the first time, but slow for further runs without warming up the disk cache again first.
The memory consumption happens after the files have been read. This causes "memory pressure" on the page cache (aka disk cache), causing it to evict data from the cache to make room for the pages for your process to write to.
Your computer probably has just barely enough free RAM to cache your working set. Closing your web browser might free up enough to make a difference! Or not, since your 11998151055
is 11.1GiB, and you're writing every page of it. (Every byte, even. You could do it with memset
for higher performance, although I assume what you've shown is just a dummy version)
BTW, another tool for investigating this would be time ./a.out
. It can show you if your program is spending all its CPU time in user-space vs. kernel ("system") time.
If user+sys adds up to the real time, your process is CPU bound. If not, it's I/O bound, and your process is blocking on disk I/O (which is normal, since counting newlines should be fast).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With