So I have this C++ program that basically parses giant dataset files and load contents into hashmap in memory (this part is being throttled in the main thread, so it never goes out of its way to take up giant chunk of time). And when that is done I flipped pointer to the new memory location, and call delete on the old one. Other than that the program is doing incoming request matching by looking up content in those in memory map (on main thread). Suppose those giant maps are wrapped in Evaluator
class:
Evaluator* oldEvaluator = mEvaluator;
Evaluator* newEvaluator = parseDataSet();
mEvaluator = newEvaluator;
delete oldEvaluator;
//And then on request processing:
mEvaluator.lookup(request)
The map can contain millions of string objects as keys. They're regular strings that could be request attributes like ip, UserAgent, etc but each is a string object inserted into the STL unordered_map.
The dataset is periodically updated but most of the time the program is just doing request attribute matching against the dataset in memory, and it's fine and efficient and no errors, except when bulk consumption of the new dataset happen. The alternative way of consuming this large dataset is to use streaming, but that's a relatively longer term solutions.
It used to be a single threaded program using event driven model but every time a complete new set is placed and destruction is called, it took too long to delete the whole thing and hence blocking the request processing.
So I put the deletion of such map onto a separate thread. Problem is while now the deletion and request processing appears to happen simultaneously I can see very visible, drastic slowdown on request processing thread.
Of course there are other processes running on the host and I do expect the 2 threads to compete for CPU cycles. But I didn't expect to see a drastic slow down on request matching thread. On average, a request should be processed 500us level but while the deletion thread was running it got as slow as 5ms. With sometimes cpu interrupts the matching thread (because it took way too long) it can go as long as 50ms, or 120ms, etc. In extreme cases a request could be taken the entire 1000ms to be processed, which is about the time the whole data structure deletion takes on another thread.
What is the best way to know the root cause of such slow down? Is it more of a CPU or memory bandwidth bottleneck? I was imagining as long as I put it on a separate thread I wouldn't care how slow it goes cos it has to delete string objects one by one after all, so I didn't expect it to affect the other thread...
EDIT: Thanks to couple comments/answers already seem to point out several possible causes:
So what should I do then? I tried Jemalloc
though not sure I use it entirely correctly --- it seems including -ljemalloc
in linker line just magically replaces libc's malloc? I tried, with no performance difference but I could be using it wrong. My program doesn't do any explicit malloc, everything is new
with unknown size in advance, and hooked up together with pointers and STL maps.
And also all strings stored in Key are specifically used for quick lookup so they can't be stored in vector with index even though that would make contiguous memory space, it will be horrible to locate them. So,
Jemalloc
solves either of this in my case)The java.util.HashMap.clear () method in Java is used to clear and remove all of the elements or mappings from a specified HashMap. Parameters: The method does not accept any parameters. Return Value: The method does not return any value. Program 1: Mapping String Values to Integer Keys.
Conclusion Modern Java's HashMap is a powerful and well-optimized data structure. Its performance can, however, be worsened by a badly designed hashCode method. In this tutorial, we looked at possible ways to make hashing fast and effective. As always, the code examples for this article are available over on GitHub.
It would also be useful to bear in mind that threads still have their own type of context switching that goes on. Increasing the number of threads doesn't increase performance capacity (as you pointed out) but it also drains CPU time by giving the kernel more work to do.
It must be so because hashes are used to compute indexes of an array with buckets. That means there's also a limited number of keys that we can store in a HashMap without hash collision. To avoid collisions as long as we can, we want to spread hashes as evenly as possible.
It might be worthwhile to store just a single std::string
for all your data combined, and use std::string_view
in the map. This eliminates mutex contention as there's only one memory allocation needed. string_view
has a trivial destructor so you don't need a thread for that.
I've successfully used this technique before to speed up a program by 2500%, but that was also because this technique reduced the total memory usage.
You can try using a std::vector
for storing the memory. std::vector
elements are stored contiguously, so it will reduce cache miss (see What is a "cache-friendly" code?)
So you will have a map<???,size_t>
instead of map<???,std::string>
you will have one more indirection to get your string (wich means an extra run time cost) but it allow you to iterate on all strings with way less cache-miss.
It would be great if you recreate the problem you are encountering with a MVCE and show it: you know, many times the problem you are thinking is your problem... is not the problem.
How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)
Given the information here I would suggest to use a profiler - gprof (compile with -g -pg) being the basic one. If you have the Intel compiler available you can use vtune.
There is a free version of vtune but I have personally used the commercial version only.
Besides from this you can insert timings in your code: from the textual description, it's not clear whether the time to populate the map is comparable to the time needed to erase it, or it grows consistently when run concurrently. I would start with if. Note that the current version of malloc() is greatly optimized for concurrency too (is this Linux? - add a tag to the question please).
For sure when you erase the map there are millions of free()
's called by std::~string()
- but you need to be sure that this is the problem or not: you can use a better approach (many mentioned in the answers/comments) or a custom allocator backed by a huge memory block that you create/destroy as a single unit.
If you provide a MVCE as a starting point, I or others will be able to provie a consistent answer (this is not an answer, yet - but too long to be a comment)
Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.
Keep in mind that each string in the map needs one (ore more) new
and one delete
(based on malloc()
and free()
respectively), being the strings either in the keys or in the values.
Since you have a map<string,<set<int>>
you have many allocations:
Every time you perform a map[string].insert(val)
of a new key, your code implicitly call malloc()
for both the string and the set. Even if the key is in the map already, a new int in the set require a new node in the set to be allocated.
So you have really many allocations while building the structure: your memory is very fragmented on one side, and your code seems really "malloc intensive", that could in principle lead to the memory calls to starve.
One peculiarity of modern memory subsystems, is that thay are optimized for multi-core systems: when one thread allocates memory on one core, there is not a global lock, but a thread-local or core-local lock for a thread-local pool.
This means that when one thread needs to free the memory allocated by another one, there is a non-local (slower) lock involved.
This means that the best approach is that each thread allocates/deallocates its own memory. Said that in principle you can optimize a lot your code with data structures that require less malloc/free interactions, your code will be more local, with respect to memory allocations, if you let each thread:
map<string,<set<int>>
And you have two threads that, repeatedly perform this task.
NOTE: you need enoguht RAM to handle concurrent evaluators, but now already you are using 2 of them concurrently loaded with a double buffering scheme (one filling, one cleaning). Are you sure your system is not swapping because of RAM exahustion?
Furthermore, this approach is scalable: you can use as many threads as you want. In your approach you were limited to 2 threads - one building the structure, one destorying it.
Without a MVCE it's a hard task to give directions. Just ideas that you only know whether can be applied at now:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With