Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using std:vector as low level buffer

The usage here is the same as Using read() directly into a C++ std:vector, but with an acount of reallocation.

The size of input file is unknown, thus the buffer is reallocated by doubling size when file size exceeds buffer size. Here's my code:

#include <vector>
#include <fstream>
#include <iostream>

int main()
{
    const size_t initSize = 1;
    std::vector<char> buf(initSize); // sizes buf to initSize, so &buf[0] below is valid
    std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
    if (ifile)
    {
        size_t bufLen = 0;
        for (buf.reserve(1024); !ifile.eof(); buf.reserve(buf.capacity() << 1))
        {
            std::cout << buf.capacity() << std::endl;
            ifile.read(&buf[0] + bufLen, buf.capacity() - bufLen);
            bufLen += ifile.gcount();
        }
        std::ofstream ofile("rebuild.jpg", std::ios_base::out|std::ios_base::binary);
        if (ofile)
        {
            ofile.write(&buf[0], bufLen);
        }
    }
}

The program prints the vector capacity just as expected, and writes the output file just the same size as input, BUT, with only the same bytes as input before offset initSize, and all zeros afterward...

Using &buf[bufLen] in read() is definitly an undefined behavior, but &buf[0] + bufLen gets the right postition to write because continuous allocation is guaranteed, isn't it? (provided initSize != 0. Note that std::vector<char> buf(initSize); sizes buf to initSize. And yes, if initSize == 0, a rumtime fatal error ocurrs in my environment.) Do I miss something? Is this also an UB? Does the standard say anything about this usage of std::vector?

Yes, I know we can calculate the file size first and allocate exactly the same buffer size, but in my project, it can be expected that the input files nearly ALWAYS be smaller than a certain SIZE, so I can set initSize to SIZE and expect no overhead (like file size calculation), and use reallocation just for "exception handling". And yes, I know I can replace reserve() with resize() and capacity() with size(), then get things work with little overhead (zero the buffer in every resizing), but I still want to get rid of any redundent operation, just a kind of paranoid...

updated 1:

In fact, we can logically deduce from the standard that &buf[0] + bufLen gets the right postition, consider:

std::vector<char> buf(128);
buf.reserve(512);
char* bufPtr0 = &buf[0], *bufPtrOutofRange = &buf[0] + 200;
buf.resize(256); std::cout << "standard guarantees no reallocation" << std::endl;
char* bufPtr1 = &buf[0], *bufInRange = &buf[200]; 
if (bufPtr0 == bufPtr1)
    std::cout << "so bufPtr0 == bufPtr1" << std::endl;
std::cout << "and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200" << std::endl;
if (bufInRange == bufPtrOutofRange)
    std::cout << "finally we have: bufInRange == bufPtrOutofRange" << std::endl;

output:

standard guarantees no reallocation
so bufPtr0 == bufPtr1
and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200
finally we have: bufInRange == bufPtrOutofRange

And here 200 can be replaced with every buf.size() <= i < buf.capacity() and the similar deduction holds.

updated 2:

Yes, I did miss something... But the problem is not continuity (see update 1), and even not failure to write memory (see my answer). Today I got some time to look into the problem, the program got the right address, wrote the right data into reserved memory, but in the next reserve(), buf is reallocated and with ONLY the elements in range [0, buf.size()) copied to the new memory. So this's the answer to the whole riddle...

Final note: If you needn't reallocation after your buffer is filled with some data, you can definitely use reserve()/capatity() instead of resize()/size(), but if you need, use the latter. Also, under all implementations available here (VC++, g++, ICC), the example works as expected:

const size_t initSize = 1;
std::vector<char> buf(initSize);
buf.reserve(1024*100); // assume the reserved space is enough for file reading
std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
if (ifile)
{
    ifile.read(&buf[0], buf.capacity());  // ok. the whole file is read into buf
    std::ofstream ofile("rebuld.jpg", std::ios_base::out|std::ios_base::binary);
    if (ofile)
    {
        ofile.write(&buf[0], ifile.gcount()); // rebuld.jpg just identical to input.jpg
    }
}
buf.reserve(1024*200); // horror! probably always lose all data in buf after offset initSize

And here's another example, quoted from 'TC++PL, 4e' pp 1041, note that the first line in the function uses reserve() rather than resize():

void fill(istream& in, string& s, int max)
// use s as target for low-level input (simplified)
{
    s.reserve(max); // make sure there is enough allocated space
    in.read(&s[0],max);
    const int n = in.gcount(); // number of characters read
    s.resize(n);
    s.shrink_to_fit();  // discard excess capacity
}

Update 3 (after 8 years): Many things happened during these years, I did not use C++ as my working language for nearly 6 years, and now I am a PhD student! Also, though many think there are UBs, the reasons they gave are quite different (and some were already shown to be not UBs), indicating this is a complex case. So, before casting votes and write answers, it is highly recommended to read and be involved in comments.

Another thing is that, with the PhD training, I can now dive into the C++ standard with relative ease, which I dared not years ago. I believe I showed in my own answer that, based on the standard, the above two code blocks should work. (The string example requires C++11.) Since my answer is still contentious (but not falsified, I believe), I do not accept it, but rather am open to critical reviews and other answers.

like image 672
wpzdm Avatar asked Sep 28 '13 07:09

wpzdm


1 Answers

reserve doesn't actually add the space to the vector, it only makes sure that you won't need a reallocation when you resize it. Instead of using reserve you should use resize, then do a final resize once you know how many bytes you actually read in.

All that reserve is guaranteed to do is prevent the invalidation of iterators and pointers as you increase the size of the vector up to capacity(). It is not guaranteed to maintain the contents of those reserved bytes unless they're part of the size().

For example, it's common for code built with a Debug flag to include extra features to make it easier to find bugs. Maybe newly allocated memory will be filled with a well defined pattern. And maybe the class will periodically scan that memory to see if it's changed, and throw an exception if it has under the assumption that only a bug could have caused that change. Such an implementation would still be standard conforming.

The example of std::string is even better, because there's a case that's almost guaranteed to fail. string::c_str() will return a pointer to the string with a null terminator character at the end. Now a conforming implementation could allocate a second buffer with room for the terminating null and return that pointer after copying the string, but that would be very wasteful. Much more likely is that the string class will just make sure its reserved buffer has room for the extra null character and write a null there as necessary. But the standard doesn't dictate when that null will be written, it could be in the call to c_str or it could be at any point where the string might be modified. So you have no way of knowing when one of your bytes is going to be overwritten.

If you really want a buffer of uninitialized bytes, std::vector<char> is probably the wrong tool anyway. You should look at a smart pointer such as std::unique_ptr<char> instead.

like image 187
Mark Ransom Avatar answered Sep 19 '22 12:09

Mark Ransom