Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MSVC ifstream performance issue with unsigned datatype

I made some tests with std::ifstream on MSVC, when reading binary files. I have big performance differences between char and unsigned char data types.

Results when reading a 512 MB binary file:

Duration read as signed: 322 ms
Duration read as unsigned: 10552 ms

Below the code I used to test:

#include <vector>
#include <iostream>
#include <fstream>
#include <chrono>
#include <limits>
#include <filesystem>

int main()
{
    const std::filesystem::path filePath{ "test.data" }; // 512 MB binary file
    const size_t fileSize{ std::filesystem::file_size(filePath) };

    {
        std::basic_ifstream<char> fileStream{ filePath, std::fstream::binary };
        std::vector<char> data;
        data.resize(fileSize);

        const auto start{ std::chrono::system_clock::now() };
        fileStream.read(data.data(), fileSize);
        const auto end{ std::chrono::system_clock::now() };

        std::cout << "Duration read as signed: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
    }

    {
        std::basic_ifstream<unsigned char> fileStream{ filePath, std::fstream::binary };
        std::vector<unsigned char> data;
        data.resize(fileSize);

        const auto start{ std::chrono::system_clock::now() };
        fileStream.read(data.data(), fileSize);
        const auto end{ std::chrono::system_clock::now() };

        std::cout << "Duration read as unsigned: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
    }

    return 0;
}

I don't understand how using a basic_ifstream<unsigned char> is 30 times slower than basic_ifstream<char> when reading a binary file.

like image 598
James Magnus Avatar asked Aug 09 '21 22:08

James Magnus


2 Answers

I've tracked this down to C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include\fstream file, line 549:

virtual streamsize __CLR_OR_THIS_CALL xsgetn(_Elem* _Ptr, streamsize _Count) override {
    // get _Count characters from stream
    if constexpr (sizeof(_Elem) == 1) {
        if (_Count <= 0) {
            return 0;
        }

        if (_Pcvt) { // if we need a nontrivial codecvt transform, do the default expensive thing
            return _Mysb::xsgetn(_Ptr, _Count);
        }

For the unsigned char it goes into that default expensive thing

Looking a bit farther, I see this:

virtual streamsize __CLR_OR_THIS_CALL xsgetn(_Elem* _Ptr, streamsize _Count) { // get _Count characters from stream
    const streamsize _Start_count = _Count;

    while (0 < _Count) {
        streamsize _Size = _Gnavail();
        if (0 < _Size) { // copy from read buffer
            if (_Count < _Size) {
                _Size = _Count;
            }

            _Traits::copy(_Ptr, gptr(), static_cast<size_t>(_Size));
            _Ptr += _Size;
            _Count -= _Size;
            gbump(static_cast<int>(_Size));
        } else {
            const int_type _Meta = uflow();
            if (_Traits::eq_int_type(_Traits::eof(), _Meta)) {
                break; // end of file, quit
            }

            // get a single character
            *_Ptr++ = _Traits::to_char_type(_Meta);
            --_Count;
        }
    }

    return _Start_count - _Count;
}

Note one-by-one processing! And that function doesn't do much:

_NODISCARD static constexpr _Elem to_char_type(const int_type& _Meta) noexcept {
    return static_cast<_Elem>(_Meta);
}
like image 178
Vlad Feinstein Avatar answered Oct 31 '22 14:10

Vlad Feinstein


The performance issue disappears when you set a read buffer like this :

    {
        std::basic_ifstream<unsigned char> fileStream{ filePath, std::fstream::binary };
        std::vector<unsigned char> data;
        data.resize(fileSize);

        unsigned char buf[8192U];
        fileStream.rdbuf()->pubsetbuf(buf, 8192U);

        const auto start{ std::chrono::system_clock::now() };
        fileStream.read(data.data(), fileSize);
        const auto end{ std::chrono::system_clock::now() };

        std::cout << "Duration read unsigned with buffer: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
    }

Results:

Duration read signed: 331 ms
Duration read unsigned: 10505 ms
Duration read unsigned with buffer: 223 ms
like image 28
James Magnus Avatar answered Oct 31 '22 14:10

James Magnus