Processing a dict file with variant length ASCII words.
constexpr int MAXLINE = 1024 * 1024 * 10; // total number of words, one word per line.
Goal: read in the whole file into memory, and be able to access each word by index.
I want to quick access each word by index.We can use two-dimensional array to achieve that; however a MAXLENGTH need to be set, not mention that MAXLENGTH is not known ahead.
constexpr int MAXLENGTH= 1024; // since I do not have the maximum length of the word
char* aray = new char[MAXLINE * MAXLENGTH];
The code above would NOT be memory friendly if most words are shorter than MAXLENGTH; and also some words can be longer than MAXLENGTH, causing errors.
For variant length object, I think vector might be best fit to this problem. So I come up with vector of vector to store them.
vector<vector<char>> array(MAXLINE);
This looks so promising, until I realize that is not the case.
I tested both approaches on a dict file with MAXLINE 4-ASCII-character words.(here all words are 4-char words)
constexpr int MAXLINE = 1024 * 1024 * 10;
if I new operator the array to store, (here MAXLENGTH is just 4)
char* aray = new char[MAXLINE * 4];
The memory consumption is roughly 40MB. However, if I try to use vector to store ( I changed the char to int32_t for just fit four chars)
vector<vector<int32_t>> array(MAXLINE);
you can also use char vector, and reserve space for 4 chars.
vector<vector<char>> array(MAXLINE);
for (auto & c : array) {
c.reserve(4);
}
The memory consumption jumps up to about 720MB (debug mode), 280MB(release mode), which is so unexpected high and can someone give me some explanations for clarification why so.
obseravation: Size of vector is implementation dependent and if you are compiling in debug mode.
As on my system
sizeof(vector<int32_t>) = 16 // debug mode
and
sizeof(vector<int32_t>) = 12 // release mode
In debug mode the momory consumption is 720MB for vector<vector<int32_t>> array(MAXLINE);, while the actual vector only takes sizeof(vector<int32_t>) * MAXLINE = 16 * 10MB = 160 MB
In relase mode, the momory consumption is 280MB , however the expected value is sizeof(vector<int32_t>) * MAXLINE = 12 * 10MB = 120 MB
Can someone explain the big difference in real memory consumption and expected consumption(calculated from sub-vector size).
Appreciate, and Happy new year!
For your case:
so, does it mean vector of vectors is not a good idea to store small objects? –
Generally no. A nested sub-vector isn't such a good solution for storing a boatload of teeny variable-sized sequences. You don't want to represent an indexed mesh that allows variable-polygons (triangles, quads, pentagons, hexagons, n-gons) using a separate std::vector instance per polygon, for example, or else you'll tend to blow up memory usage and have a really slow solution: slow because there's a heap allocation involved for every single freaking polygon, and explosive in memory because vector often preallocates some memory for the elements in addition to storing size and capacity in ways that are often larger than needed if you have a boatload of teeny sequences.
vector is an excellent data structure for storing a million things contiguously, but not so excellent for storing a million teeny vectors.
In such cases even a singly-linked indexed list can work better with the indices pointing to a bigger vector, performing so much faster and sometimes even using less memory in spite of the 32-bit link overhead, like this:

That said, for your particular case with a big ol' random-access sequence of variable-length strings, this is what I recommend:
// stores the starting index of each null-terminated string
std::vector<int> string_start;
// stores the characters for *all* the strings in one single vector.
std::vector<char> strings;
That will reduce the overhead down to closer to 32-bits per string entry (assuming int is 32-bits) and you will no longer require a separate heap allocation for every single string entry you add.
After you finish reading everything in, you can minimize memory use with a compaction to truncate the array (eliminating any excess reserve capacity):
// Compact memory use using copy-and-swap.
vector<int>(string_start).swap(string_start);
vector<char>(strings).swap(strings);
Now to retrieve the nth string, you can do this:
const char* str = strings.data() + string_start[n];
If you have need for search capabilities as well, you can actually store a boatload of strings and search for them quickly (including things like prefix-based searches) storing less memory than even the above solution would take using a compressed trie. It's a bit more of an involved solution though but might be worthwhile if your software revolves around dictionaries of strings and searching through them and you might just be able to find some third party library that already provides one for ya.
std::string
Just for completeness I thought I'd throw in a mention of std::string. Recent implementations often optimize for small strings by storing a buffer in advance which is not separately heap-allocated. However, in your case that can lead to even more explosive memory usage since that makes sizeof(string) bigger in ways that can consume far more memory than needed for really short strings. It does make std::string more useful though for temporary strings, making it so you might get something that performs perfectly fine if you fetched std::string out of that big vector of chars on demand like so:
std::string str = strings.data() + string_start[n];
... as opposed to:
const char* str = strings.data() + string_start[n];
That said, the big vector of chars would do much better performance and memory wise for storing all the strings. Just generally speaking, generalized containers of any sort tend to cease to perform so well if you want to store millions of teeny ones.
The main conceptual problem is that when the desire is for like a million variable-sized sequences, the variable-sized nature of the requirement combined with the generalized nature of the container will imply that you have a million teeny memory managers, all having to potentially allocate on the heap or, if not, allocate more data than is needed, along with keeping track of its size/capacity if it's contiguous, and so on. Inevitably a million+ managers of their own memory gets quite expensive.
So in these cases, it often pays to forego the convenience of a "complete, independent" container and instead use one giant buffer, or one giant container storing the element data as with the case of vector<char> strings, along with another big container that indexes or points to it, as in the case of vector<int> string_start. With that you can represent the analogical one million variable-length strings just using two big containers instead of a million small ones.
removing nth string
Your case doesn't sound like you need to ever remove a string entry, just load and access, but if you ever need to remove a string, that can get tricky when all the strings and indices to their starting positions are stored in two giant buffers.
Here I recommend, if you wanna do this, to not actually remove the string immediately from the buffer. Instead you can simply do this:
// Indicate that the nth string has been removed.
string_start[n] = -1;
When iterating over available strings, just skip the ones where string_start[n] is -1. Then, every now and then to compact memory use after a number of strings have been removed, do this:
void compact_buffers(vector<char>& strings, vector<int>& string_start)
{
// Create new buffers to hold the new data excluding removed strings.
vector<char> new_strings;
vector<int> new_string_start;
new_strings.reserve(strings.size());
new_string_start.reserve(string_start.size());
// Store a write position into the 'new_strings' buffer.
int write_pos = 0;
// Copy strings to new buffers, skipping over removed ones.
for (int start: string_start)
{
// If the string has not been removed:
if (start != -1)
{
// Fetch the string from the old buffer.
const char* str = strings.data() + start;
// Fetch the size of the string including the null terminator.
const size_t len = strlen(str) + 1;
// Insert the string to the new buffer.
new_strings.insert(new_strings.end(), str, str + len);
// Append the current write position to the starting positions
// of the new strings.
new_string_start.push_back(write_pos);
// Increment the write position by the string size.
write_pos += static_cast<int>(len);
}
}
// Swap compacted new buffers with old ones.
vector<char>(new_strings).swap(strings);
vector<int>(new_string_start).swap(string_start);
}
You can call the above periodically to compact memory use after removing a number of strings.
String Sequence
Here's some code throwing all this stuff together that you can freely use and modify however you like.
////////////////////////////////////////////////////////
// StringSequence.hpp:
////////////////////////////////////////////////////////
#ifndef STRING_SEQUENCE_HPP
#define STRING_SEQUENCE_HPP
#include <vector>
/// Stores a sequence of strings.
class StringSequence
{
public:
/// Creates a new sequence of strings.
StringSequence();
/// Inserts a new string to the back of the sequence.
void insert(const char str[]);
/// Inserts a new string to the back of the sequence.
void insert(size_t len, const char str[]);
/// Removes the nth string.
void erase(size_t n);
/// @return The nth string.
const char* operator[](size_t n) const;
/// @return The range of indexable strings.
size_t range() const;
/// @return True if the nth index is occupied by a string.
bool occupied(size_t n) const;
/// Compacts the memory use of the sequence.
void compact();
/// Swaps the contents of this sequence with the other.
void swap(StringSequence& other);
private:
std::vector<char> buffer;
std::vector<size_t> start;
size_t write_pos;
size_t num_removed;
};
#endif
////////////////////////////////////////////////////////
// StringSequence.cpp:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <cassert>
StringSequence::StringSequence(): write_pos(1), num_removed(0)
{
// Reserve the front of the buffer for empty strings.
// We'll point removed strings here.
buffer.push_back('\0');
}
void StringSequence::insert(const char str[])
{
assert(str && "Trying to insert a null string!");
insert(strlen(str), str);
}
void StringSequence::insert(size_t len, const char str[])
{
const size_t str_size = len + 1;
buffer.insert(buffer.end(), str, str + str_size);
start.push_back(write_pos);
write_pos += str_size;
}
void StringSequence::erase(size_t n)
{
assert(occupied(n) && "The nth string has already been removed!");
start[n] = 0;
++num_removed;
}
const char* StringSequence::operator[](size_t n) const
{
return &buffer[0] + start[n];
}
size_t StringSequence::range() const
{
return start.size();
}
bool StringSequence::occupied(size_t n) const
{
return start[n] != 0;
}
void StringSequence::compact()
{
if (num_removed > 0)
{
// Create a new sequence excluding removed strings.
StringSequence new_seq;
new_seq.buffer.reserve(buffer.size());
new_seq.start.reserve(start.size());
for (size_t j=0; j < range(); ++j)
{
const char* str = (*this)[j];
if (occupied(j))
new_seq.insert(str);
}
// Swap the new sequence with this one.s
new_seq.swap(*this);
}
// Remove excess capacity.
if (buffer.capacity() > buffer.size())
std::vector<char>(buffer).swap(buffer);
if (start.capacity() > start.size())
std::vector<size_t>(start).swap(start);
}
void StringSequence::swap(StringSequence& other)
{
buffer.swap(other.buffer);
start.swap(other.start);
std::swap(write_pos, other.write_pos);
std::swap(num_removed, other.num_removed);
}
////////////////////////////////////////////////////////
// Quick demo:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <iostream>
using namespace std;
int main()
{
StringSequence seq;
seq.insert("foo");
seq.insert("bar");
seq.insert("baz");
seq.insert("hello");
seq.insert("world");
seq.erase(2);
seq.erase(3);
cout << "Before compaction:" << endl;
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
cout << "After compaction:" << endl;
seq.compact();
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
}
Output:
Before compaction:
0: foo
1: bar
4: world
After compaction:
0: foo
1: bar
2: world
I didn't bother to make it standard-compliant (too lazy and the result isn't necessarily that much more useful for this particular situation) but hopefully not a strong need here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With