We set up two identical HP Z840 Workstations with the following specs
and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each.
Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads.
#include <Windows.h>
#include <vector>
#include <ppl.h>
unsigned __int64 ZQueryPerformanceCounter()
{
unsigned __int64 c;
::QueryPerformanceCounter((LARGE_INTEGER *)&c);
return c;
}
unsigned __int64 ZQueryPerformanceFrequency()
{
unsigned __int64 c;
::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
return c;
}
class CZPerfCounter {
public:
CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
void reset() { m_st = ZQueryPerformanceCounter(); };
unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
static unsigned __int64 frequency() { return m_freq; };
private:
unsigned __int64 m_st;
static unsigned __int64 m_freq;
};
unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();
int main(int argc, char ** argv)
{
SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
int ncpu = sysinfo.dwNumberOfProcessors;
if (argc == 2) {
ncpu = atoi(argv[1]);
}
{
printf("No of threads %d\n", ncpu);
try {
concurrency::Scheduler::ResetDefaultSchedulerPolicy();
int min_threads = 1;
int max_threads = ncpu;
concurrency::SchedulerPolicy policy
(2 // two entries of policy settings
, concurrency::MinConcurrency, min_threads
, concurrency::MaxConcurrency, max_threads
);
concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
}
catch (concurrency::default_scheduler_exists &) {
printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n");
}
static int cnt = 100;
static int num_fills = 1;
CZPerfCounter pcTotal;
// malloc/free
printf("malloc/free\n");
{
CZPerfCounter pc;
for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
concurrency::parallel_for(0, 50, [i](size_t x) {
std::vector<void *> ptrs;
ptrs.reserve(cnt);
for (int n = 0; n < cnt; n++) {
auto p = malloc(i);
ptrs.emplace_back(p);
}
for (int x = 0; x < num_fills; x++) {
for (auto p : ptrs) {
memset(p, num_fills, i);
}
}
for (auto p : ptrs) {
free(p);
}
});
printf("size %4d MB, elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
pc.reset();
}
}
printf("\n");
printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0);
}
return 0;
}
Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability.
We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10.
Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit.
*** EDITED
Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue.
*** EDITED
Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU.
*** EDITED
It happens in Server 2016 as well. I added the tag windows-server-2016.
*** EDITED
Using info from @Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock.
Windows 10 vs Windows 7 – Gaming and DirectX 12 In fact, there's a slight drop in performance on average, with Windows 10 tending to be about 0.5% slower than Windows 7, particularly with older games – Crysis 3, for instance – although there are some instances where the roles are reversed.
As for hard drive requirement, Windows 7 requires a hard drive not less than 16 GB for 32-bit OS and 20 GB for 64-bit OS; Windows 10 requires a hard drive not less than 16 GB for 32-bit OS and 32 GB for 64-bit OS. Furthermore, many users report that Windows 10 occupies more memory and disk than Windows 7.
Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.
Here is the updated graph.
Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.
Unfortunately not an answer, just some additional insight.
Little experiment with a different allocation strategy:
#include <Windows.h>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <atomic>
#include <iostream>
#include <chrono>
class AllocTest
{
public:
virtual void* Alloc(size_t size) = 0;
virtual void Free(void* allocation) = 0;
};
class BasicAlloc : public AllocTest
{
public:
void* Alloc(size_t size) override {
return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
}
void Free(void* allocation) override {
VirtualFree(allocation, NULL, MEM_RELEASE);
}
};
class ThreadAlloc : public AllocTest
{
public:
ThreadAlloc() {
t = std::thread([this]() {
std::unique_lock<std::mutex> qlock(this->qm);
do {
this->qcv.wait(qlock, [this]() {
return shutdown || !q.empty();
});
{
std::unique_lock<std::mutex> rlock(this->rm);
while (!q.empty())
{
q.front()();
q.pop();
}
}
rcv.notify_all();
} while (!shutdown);
});
}
~ThreadAlloc() {
{
std::unique_lock<std::mutex> lock1(this->rm);
std::unique_lock<std::mutex> lock2(this->qm);
shutdown = true;
}
qcv.notify_all();
rcv.notify_all();
t.join();
}
void* Alloc(size_t size) override {
void* target = nullptr;
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([this, &target, size]() {
target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
VirtualLock(target, size);
VirtualUnlock(target, size);
});
}
qcv.notify_one();
{
std::unique_lock<std::mutex> lock(this->rm);
rcv.wait(lock, [&target]() {
return target != nullptr;
});
}
return target;
}
void Free(void* allocation) override {
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([allocation]() {
VirtualFree(allocation, NULL, MEM_RELEASE);
});
}
qcv.notify_one();
}
private:
std::queue<std::function<void()>> q;
std::condition_variable qcv;
std::condition_variable rcv;
std::mutex qm;
std::mutex rm;
std::thread t;
std::atomic_bool shutdown = false;
};
int main()
{
SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);
BasicAlloc alloc1;
ThreadAlloc alloc2;
AllocTest *allocator = &alloc2;
const size_t buffer_size =1*1024*1024;
const size_t buffer_count = 10*1024;
const unsigned int thread_count = 32;
std::vector<void*> buffers;
buffers.resize(buffer_count);
std::vector<std::thread> threads;
threads.resize(thread_count);
void* reference = allocator->Alloc(buffer_size);
std::memset(reference, 0xaa, buffer_size);
auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {
for (int i = thread_id; i < buffer_count; i+= thread_count) {
buffers[i] = allocator->Alloc(buffer_size);
std::memcpy(buffers[i], reference, buffer_size);
allocator->Free(buffers[i]);
}
};
for (int i = 0; i < 10; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int t = 0; t < thread_count; t++) {
threads[t] = std::thread(func, t);
}
for (int t = 0; t < thread_count; t++) {
threads[t].join();
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << duration << std::endl;
}
DebugBreak();
return 0;
}
Under all sane conditions, BasicAlloc
is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which ThreadAlloc
could outperform it. ThreadAlloc
is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!)
However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where BasicAlloc
actually scales negatively, while ThreadAlloc
just "stalls" on the base line overhead of soft faults.
If you profile the two different allocation strategies, you can see that for a low thread count, KiPageFault
shifts from memcpy
on BasicAlloc
to VirtualLock
on ThreadAlloc
.
For higher thread and core counts, eventually ExpWaitForSpinLockExclusiveAndAcquire
starts emerging from virtually zero load to up to 50% with BasicAlloc
, while ThreadAlloc
only maintains the constant overhead from KiPageFault
itself.
Well, the stall with ThreadAlloc
is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section.
You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case.
The spin-lock was also already present in the Windows 7 and earlier implementations of KiPageFault
. So what did change?
Simple answer: KiPageFault
itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before.
If someone whishes to disassemble KiPageFault
to find the most expensive part - be my guest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With