Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance decreases with a higher number of threads (no synchronization)

I have a data structure (a vector) which elements have to be parsed by a function, where elements can be parsed by different threads.

Following is the parsing method:

void ConsumerPool::parse(size_t n_threads, size_t id)
{
    for (size_t idx = id; idx < nodes.size(); idx += n_threads)
    {
        // parse node
        //parse(nodes[idx]);
        parse(idx);
    }
}

Where:

  • n_threads is the total number of threads
  • id is the (univocal) index of the current thread

and the threads are created as follow:

std::vector<std::thread> threads;

for (size_t i = 0; i < n_threads; i++)
    threads.emplace_back(&ConsumerPool::parse, this, n_threads, i);

Unfortunately, even if this method works, the performance of my application decreases if the number of threads is too high. I would like to understand why the performance decreases even if there's no synchronization between these threads.

Following are the elapsed times (between the threads start and the last join() return) according to the number of threads used:

  • 2 threads: 500 ms
  • 3 threads: 385 ms
  • 4 threads: 360 ms
  • 5 threads: 475 ms
  • 6 threads: 580 ms
  • 7 threads: 635 ms
  • 8 threads: 660 ms

The time necessary for the threads creation is always between 1/2 ms. The software has been tested by using its release build. Following is my configuration:

2x Intel(R) Xeon(R) CPU E5507 @ 2.27GHz

Maximum speed:  2.26 GHz
Sockets:    2
Cores:  8
Logical processors: 8
Virtualization: Enabled
L1 cache:   512 KB
L2 cache:   2.0 MB
L3 cache:   8.0 MB

EDIT:

What the parse() function does is the following:

// data shared between threads (around 300k elements)
std::vector<std::unique_ptr<Foo>> vfoo;
std::vector<rapidxml::xml_node<>*> nodes;
std::vector<std::string> layers;

void parse(int idx)
{
    auto p = vfoo[idx];

    // p->parse() allocate memory according to the content of the XML node
    if (!p->parse(nodes[idx], layers))
        vfoo[idx].reset();
}
like image 588
Nick Avatar asked Aug 08 '16 13:08

Nick


2 Answers

The processor you are using Intel(R) Xeon(R) CPU E5507 has only 4 cores (see http://ark.intel.com/products/37100/Intel-Xeon-Processor-E5507-4M-Cache-2_26-GHz-4_80-GTs-Intel-QPI). So having more threads than 4 would cause the slow down because of the context switching as is visible from the data you have provided.

You can read more about the context switching at the following link: https://en.wikipedia.org/wiki/Context_switch

like image 89
Nemanja Trifunovic Avatar answered Nov 09 '22 15:11

Nemanja Trifunovic


update:

We still don't have a lot of info about the memory access patterns of parse(), and how much time it spends reading input data from memory vs. how much time spent writing/reading private scratch memory.

You say p->parse() "allocates memory according to the content of the XML node". If it frees it again, you may see a big speedup from keeping a big-enough scratch buffer allocated in each thread. Memory allocation/deallocation is a "global" thing that requires synchronization between threads. A thread-aware allocator can hopefully handle an allocate/free / allocate/free pattern by satisfying allocations from memory just freed by that thread, so it's probably still hot in private L1 or L2 cache on that core.

Use some kind of profiling to find the real hotspots. It might be memory allocation/deallocation, or it might be code that reads some memory.


Your dual-socket Nehalem Xeon doesn't have hyperthreading, so you can't be running into issues with threads slowing each other down if a non-HT-aware OS schedules two on two logical cores of the same physical core.


You should investigate with performance counters (e.g. Linux perf stat, or Intel's VTune) whether you're getting more cache misses per thread once you pass 4 threads. Nehalem uses large shared (for the whole socket) L3 (aka last-level) caches, so more threads running on the same socket creates more pressure on that. The relevant perf events will be something like LLC_something, IIRC.

You should definitely look at L1/L2 misses, and see how those scale with number of threads, and how that changes with strided vs. contiguous access to node[].

There are other perf counters you can check to look for false sharing (one thread's private variable sharing a cache line with another thread's private variable, so the cache line bounces between cores). Really just look for any perf events that change with number of threads; that could point the way towards an explanation.


A multi-socket system like your 2-socket Nehalem will have NUMA (Non-uniform_memory_access). A NUMA-aware OS will try to allocate memory that's fast for the core doing the allocation.

So presumably your buffer has all of its physical pages in memory attached to one of your two sockets. In this case it's probably not something you can or should avoid, since I assume you're filling the array in a single-threaded way before handing it off to multiple threads for parsing. In general, though, try to allocate memory (especially scratch buffers) in the thread that will use it most, when that's convenient.

This may partially explain less-than-perfect scaling with number of threads. Although more likely this has nothing to do with things, if @AntonMalyshev's answer didn't help. Having each thread work on a contiguous range, instead of striding through the array with a stride of n_threads, should be better for L2 / L1 cache efficiency.

node[] is a vector of pointers (so with 8 threads, each thread uses only 8 bytes of each 64 byte cache line it touches in node[]). However, each thread presumably touches way more memory in the pointed-to data structures and strings. If node entries point to monotonically-increasing positions in other data structures and the string, then the strided access to node[] creates non-contiguous access patterns to most of the memory touched by the thread.


One possible benefit of the strided access pattern: Strided means that if all threads run at more or less the same speed, they're all looking at the same part of memory at the same time. Threads that get ahead will slow down from L3 misses, while other threads catch up because they see L3 hits. (Unless something happens that lets one thread get too far behind, like the OS de-scheduling it for a time slice.)

So maybe L3 vs. RAM bandwidth / latency is more of an issue than efficient use of per-core L2/L1. Maybe with more threads, L3 bandwidth can't keep up with all the requests for the same cache lines from the L2 caches of multiple cores. (L3 isn't fast enough to satisfy constant L2 misses from all cores at once, even if they all hit in L3.)

This argument applies to the everything pointed to by node[] only if contiguous ranges of node[] point to contiguous ranges of other memory.

like image 27
Peter Cordes Avatar answered Nov 09 '22 15:11

Peter Cordes