Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does unordered_set use significantly more RAM than data it contains?

I have a relatively large file that I needed to ensure contained only unique lines. The file is only 500MB. I understand that there is plenty of overhead, but I was seeing nearly 5GB of RAM usage. I could have done this using an external merge sort and maintain a small amount of RAM, but this seemed faster to code.

I am using VC++14.

#include <string>
#include <vector>
#include <fstream>
#include <iostream>
#include <algorithm>
#include <unordered_set>

using std::vector;
using std::string;
using std::unordered_set;

class uniqify {
    unordered_set<string> s;
public:
    auto exists(const string &filename) const -> bool {
        std::ifstream fin(filename);
        bool good = fin.good();
        return fin.close(), good;
    }

    void read(const string &filename) {
        std::ifstream input(filename);
        string line;
        while (std::getline(input, line))
            if (line.size())
                s.insert(line);
    }

    void write(const string &filename) const {
        std::ofstream fout(filename);
        for (auto line : s)
            fout << line << "\n";
        fout.close();
    }
};

int main(int argc, char **argv) {
    uniqify u;
    string file("file.txt");
    if(u.exists(file))
        u.read(file);
    u.write("output_file.txt");
    return 0;
}

What causes the RAM to skyrocket over 10x as much?

like image 428
Goodies Avatar asked Dec 06 '22 18:12

Goodies


1 Answers

An unordered_set is a node-based container. Last time I checked, MSVC uses a doubly linked list to store the elements, and a vector of iterators into that linked list to delineate the buckets. Default max_load_factor() of unordered_set is 1, so there are at least many buckets as nodes. And it stores roughly one list iterator - which is one pointer - per bucket. So for each node you have two pointers' worth of overhead from the doubly linked list, plus at least one pointer from the bucket, for three pointers total.

Then std::string adds its own overhead on top. MSVC's std::string is, I believe, two pointers + 16 bytes SSO buffer. Strings longer than 15 characters will use dynamic allocation, which costs more.

So each string in the set costs at least 5 pointers + 16 bytes SSO buffer, at 8 bytes per pointer that's 56 bytes per string minimum. With 55M strings that's about 3GB right there. And we haven't counted longer-than-15-character strings, nor the per-node memory allocation overhead, which can easily bring it up to 5GB.

like image 138
T.C. Avatar answered May 16 '23 08:05

T.C.