Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting large hashmaps with millions of strings on one thread affects performance on another thread

So I have this C++ program that basically parses giant dataset files and load contents into hashmap in memory (this part is being throttled in the main thread, so it never goes out of its way to take up giant chunk of time). And when that is done I flipped pointer to the new memory location, and call delete on the old one. Other than that the program is doing incoming request matching by looking up content in those in memory map (on main thread). Suppose those giant maps are wrapped in Evaluator class:

Evaluator* oldEvaluator = mEvaluator;
Evaluator* newEvaluator = parseDataSet();
mEvaluator = newEvaluator;
delete oldEvaluator;

//And then on request processing:
mEvaluator.lookup(request)

The map can contain millions of string objects as keys. They're regular strings that could be request attributes like ip, UserAgent, etc but each is a string object inserted into the STL unordered_map.

The dataset is periodically updated but most of the time the program is just doing request attribute matching against the dataset in memory, and it's fine and efficient and no errors, except when bulk consumption of the new dataset happen. The alternative way of consuming this large dataset is to use streaming, but that's a relatively longer term solutions.

It used to be a single threaded program using event driven model but every time a complete new set is placed and destruction is called, it took too long to delete the whole thing and hence blocking the request processing.

So I put the deletion of such map onto a separate thread. Problem is while now the deletion and request processing appears to happen simultaneously I can see very visible, drastic slowdown on request processing thread.

Of course there are other processes running on the host and I do expect the 2 threads to compete for CPU cycles. But I didn't expect to see a drastic slow down on request matching thread. On average, a request should be processed 500us level but while the deletion thread was running it got as slow as 5ms. With sometimes cpu interrupts the matching thread (because it took way too long) it can go as long as 50ms, or 120ms, etc. In extreme cases a request could be taken the entire 1000ms to be processed, which is about the time the whole data structure deletion takes on another thread.

What is the best way to know the root cause of such slow down? Is it more of a CPU or memory bandwidth bottleneck? I was imagining as long as I put it on a separate thread I wouldn't care how slow it goes cos it has to delete string objects one by one after all, so I didn't expect it to affect the other thread...

EDIT: Thanks to couple comments/answers already seem to point out several possible causes:

  1. Memory fragmentation. Because less frequently visited string is stored in more expensive memory locations (so cache miss), or because it's stored in unordered_map with many pointers, or because system is doing memory compacting while deleting holes all over the place? But why exactly is this affecting slowness in another thread?
  2. One comment mentioned it's heap contention due to thread-safe locking? So the entire heap for this program locks down because one thread is busy deleting holes that prevents another's access of heap memory? Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.

So what should I do then? I tried Jemalloc though not sure I use it entirely correctly --- it seems including -ljemalloc in linker line just magically replaces libc's malloc? I tried, with no performance difference but I could be using it wrong. My program doesn't do any explicit malloc, everything is new with unknown size in advance, and hooked up together with pointers and STL maps.

And also all strings stored in Key are specifically used for quick lookup so they can't be stored in vector with index even though that would make contiguous memory space, it will be horrible to locate them. So,

  1. How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)
  2. What can I do to fix it without changing my consumption model to streaming? Assuming the root causes were the above 2, seems like I should do either/both 2 things: 1) allocate all my STL maps along with the objects all from one pool? How do I do that? 2) reduce heap contention (I don't know if Jemalloc solves either of this in my case)
like image 741
Superziyi Avatar asked Feb 27 '20 09:02

Superziyi


People also ask

How to clear all the mappings in a hashmap in Java?

The java.util.HashMap.clear () method in Java is used to clear and remove all of the elements or mappings from a specified HashMap. Parameters: The method does not accept any parameters. Return Value: The method does not return any value. Program 1: Mapping String Values to Integer Keys.

Is HashMap a good data structure?

Conclusion Modern Java's HashMap is a powerful and well-optimized data structure. Its performance can, however, be worsened by a badly designed hashCode method. In this tutorial, we looked at possible ways to make hashing fast and effective. As always, the code examples for this article are available over on GitHub.

Does increasing the number of threads on a system increase performance?

It would also be useful to bear in mind that threads still have their own type of context switching that goes on. Increasing the number of threads doesn't increase performance capacity (as you pointed out) but it also drains CPU time by giving the kernel more work to do.

Why do we need to spread hashes evenly in HashMap?

It must be so because hashes are used to compute indexes of an array with buckets. That means there's also a limited number of keys that we can store in a HashMap without hash collision. To avoid collisions as long as we can, we want to spread hashes as evenly as possible.


3 Answers

It might be worthwhile to store just a single std::string for all your data combined, and use std::string_view in the map. This eliminates mutex contention as there's only one memory allocation needed. string_view has a trivial destructor so you don't need a thread for that.

I've successfully used this technique before to speed up a program by 2500%, but that was also because this technique reduced the total memory usage.

like image 95
MSalters Avatar answered Nov 03 '22 01:11

MSalters


You can try using a std::vector for storing the memory. std::vector elements are stored contiguously, so it will reduce cache miss (see What is a "cache-friendly" code?)

So you will have a map<???,size_t> instead of map<???,std::string> you will have one more indirection to get your string (wich means an extra run time cost) but it allow you to iterate on all strings with way less cache-miss.

like image 26
Martin Morterol Avatar answered Nov 02 '22 23:11

Martin Morterol


It would be great if you recreate the problem you are encountering with a MVCE and show it: you know, many times the problem you are thinking is your problem... is not the problem.

How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)

Given the information here I would suggest to use a profiler - gprof (compile with -g -pg) being the basic one. If you have the Intel compiler available you can use vtune.

There is a free version of vtune but I have personally used the commercial version only.

Besides from this you can insert timings in your code: from the textual description, it's not clear whether the time to populate the map is comparable to the time needed to erase it, or it grows consistently when run concurrently. I would start with if. Note that the current version of malloc() is greatly optimized for concurrency too (is this Linux? - add a tag to the question please).

For sure when you erase the map there are millions of free()'s called by std::~string() - but you need to be sure that this is the problem or not: you can use a better approach (many mentioned in the answers/comments) or a custom allocator backed by a huge memory block that you create/destroy as a single unit.

If you provide a MVCE as a starting point, I or others will be able to provie a consistent answer (this is not an answer, yet - but too long to be a comment)

Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.

Keep in mind that each string in the map needs one (ore more) new and one delete (based on malloc() and free() respectively), being the strings either in the keys or in the values.

What do you have in the "values" of the map?

Since you have a map<string,<set<int>> you have many allocations: Every time you perform a map[string].insert(val) of a new key, your code implicitly call malloc() for both the string and the set. Even if the key is in the map already, a new int in the set require a new node in the set to be allocated.

So you have really many allocations while building the structure: your memory is very fragmented on one side, and your code seems really "malloc intensive", that could in principle lead to the memory calls to starve.

Multithreaded memory allocations/deallocations

One peculiarity of modern memory subsystems, is that thay are optimized for multi-core systems: when one thread allocates memory on one core, there is not a global lock, but a thread-local or core-local lock for a thread-local pool.

This means that when one thread needs to free the memory allocated by another one, there is a non-local (slower) lock involved.

This means that the best approach is that each thread allocates/deallocates its own memory. Said that in principle you can optimize a lot your code with data structures that require less malloc/free interactions, your code will be more local, with respect to memory allocations, if you let each thread:

  • get one block of data
  • build the map<string,<set<int>>
  • free it

And you have two threads that, repeatedly perform this task.

NOTE: you need enoguht RAM to handle concurrent evaluators, but now already you are using 2 of them concurrently loaded with a double buffering scheme (one filling, one cleaning). Are you sure your system is not swapping because of RAM exahustion?

Furthermore, this approach is scalable: you can use as many threads as you want. In your approach you were limited to 2 threads - one building the structure, one destorying it.

Optimizing

Without a MVCE it's a hard task to give directions. Just ideas that you only know whether can be applied at now:

  • replace the set with sorted vector, reserved at creation time
  • replace the map keys with a flat vector of equally spaced, sorted strings
  • store the string keys sequentially in a flat vector, add hashes to keep track of the keys of the map. Add an hash-map to keep track of the ordering of the strings in the vector.
like image 31
Sigi Avatar answered Nov 03 '22 00:11

Sigi