How to read 4GB file on 32bit system

Question

In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.

In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.

std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
    m_numLines++;
}

And ok, that's working but to slowly here is a time for my 3.6 GB of data:

real    1m4.155s
user    0m0.000s
sys     0m0.030s

I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.

So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?

Maybe you have another much better solution. I have to just process each line.

Thanks,
Bart

sehe · Accepted Answer

Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?

It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here

Fast textfile reading in c++

Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '
', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

Mats Petersson · Answer

The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.

However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).

How to read 4GB file on 32bit system

Tags:

c++

large-files

boost

32-bit

data-processing

bioky

2 Answers

sehe

Mats Petersson

Recent Activity

Donate For Us

How to read 4GB file on 32bit system

Tags:

c++

large-files

boost

32-bit

data-processing

bioky

2 Answers

sehe

Mats Petersson

Related questions

Recent Activity

Donate For Us