We set up two identical HP Z840 Workstations with the following specs <ul> <li>2 x Xeon E5-2690 v4 @ 2.60GHz (Turbo Boost ON, HT OFF, total 28 logical CPUs)</li> <li>32GB DDR4 2400 Memory, Quad-channel</li> </ul> and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each. Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads. <pre class="prettyprint lang-cpp prettyprint-override"><code>#include <Windows.h> #include <vector> #include <ppl.h> unsigned __int64 ZQueryPerformanceCounter() { unsigned __int64 c; ::QueryPerformanceCounter((LARGE_INTEGER *)&c); return c; } unsigned __int64 ZQueryPerformanceFrequency() { unsigned __int64 c; ::QueryPerformanceFrequency((LARGE_INTEGER *)&c); return c; } class CZPerfCounter { public: CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {}; void reset() { m_st = ZQueryPerformanceCounter(); }; unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; }; unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); }; unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); }; static unsigned __int64 frequency() { return m_freq; }; private: unsigned __int64 m_st; static unsigned __int64 m_freq; }; unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency(); int main(int argc, char ** argv) { SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); int ncpu = sysinfo.dwNumberOfProcessors; if (argc == 2) { ncpu = atoi(argv[1]); } { printf("No of threads %d\n", ncpu); try { concurrency::Scheduler::ResetDefaultSchedulerPolicy(); int min_threads = 1; int max_threads = ncpu; concurrency::SchedulerPolicy policy (2 // two entries of policy settings , concurrency::MinConcurrency, min_threads , concurrency::MaxConcurrency, max_threads ); concurrency::Scheduler::SetDefaultSchedulerPolicy(policy); } catch (concurrency::default_scheduler_exists &) { printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n"); } static int cnt = 100; static int num_fills = 1; CZPerfCounter pcTotal; // malloc/free printf("malloc/free\n"); { CZPerfCounter pc; for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) { concurrency::parallel_for(0, 50, [i](size_t x) { std::vector<void *> ptrs; ptrs.reserve(cnt); for (int n = 0; n < cnt; n++) { auto p = malloc(i); ptrs.emplace_back(p); } for (int x = 0; x < num_fills; x++) { for (auto p : ptrs) { memset(p, num_fills, i); } } for (auto p : ptrs) { free(p); } }); printf("size %4d MB, elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0); pc.reset(); } } printf("\n"); printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0); } return 0; } </code></pre> Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability. <img src="https://i.stack.imgur.com/dLWUI.png" alt="Windows 10 memory access is not scalable"> We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10. Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit. *** EDITED Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue. <img src="https://i.stack.imgur.com/Cj7sQ.png" alt="enter image description here"> *** EDITED Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU. <img src="https://i.stack.imgur.com/19PRh.png" alt="enter image description here"> *** EDITED It happens in Server 2016 as well. I added the tag windows-server-2016. *** EDITED Using info from @Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock. <img src="https://i.stack.imgur.com/lLE2E.png" alt="enter image description here">

Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation. Here is the updated graph. <img src="https://i.stack.imgur.com/kbcF5.png" alt="enter image description here"> Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.

Unfortunately not an answer, just some additional insight. Little experiment with a different allocation strategy: <pre class="prettyprint"><code>#include <Windows.h> #include <thread> #include <condition_variable> #include <mutex> #include <queue> #include <atomic> #include <iostream> #include <chrono> class AllocTest { public: virtual void* Alloc(size_t size) = 0; virtual void Free(void* allocation) = 0; }; class BasicAlloc : public AllocTest { public: void* Alloc(size_t size) override { return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE); } void Free(void* allocation) override { VirtualFree(allocation, NULL, MEM_RELEASE); } }; class ThreadAlloc : public AllocTest { public: ThreadAlloc() { t = std::thread([this]() { std::unique_lock<std::mutex> qlock(this->qm); do { this->qcv.wait(qlock, [this]() { return shutdown || !q.empty(); }); { std::unique_lock<std::mutex> rlock(this->rm); while (!q.empty()) { q.front()(); q.pop(); } } rcv.notify_all(); } while (!shutdown); }); } ~ThreadAlloc() { { std::unique_lock<std::mutex> lock1(this->rm); std::unique_lock<std::mutex> lock2(this->qm); shutdown = true; } qcv.notify_all(); rcv.notify_all(); t.join(); } void* Alloc(size_t size) override { void* target = nullptr; { std::unique_lock<std::mutex> lock(this->qm); q.emplace([this, &target, size]() { target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE); VirtualLock(target, size); VirtualUnlock(target, size); }); } qcv.notify_one(); { std::unique_lock<std::mutex> lock(this->rm); rcv.wait(lock, [&target]() { return target != nullptr; }); } return target; } void Free(void* allocation) override { { std::unique_lock<std::mutex> lock(this->qm); q.emplace([allocation]() { VirtualFree(allocation, NULL, MEM_RELEASE); }); } qcv.notify_one(); } private: std::queue<std::function<void()>> q; std::condition_variable qcv; std::condition_variable rcv; std::mutex qm; std::mutex rm; std::thread t; std::atomic_bool shutdown = false; }; int main() { SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024); BasicAlloc alloc1; ThreadAlloc alloc2; AllocTest *allocator = &alloc2; const size_t buffer_size =1*1024*1024; const size_t buffer_count = 10*1024; const unsigned int thread_count = 32; std::vector<void*> buffers; buffers.resize(buffer_count); std::vector<std::thread> threads; threads.resize(thread_count); void* reference = allocator->Alloc(buffer_size); std::memset(reference, 0xaa, buffer_size); auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) { for (int i = thread_id; i < buffer_count; i+= thread_count) { buffers[i] = allocator->Alloc(buffer_size); std::memcpy(buffers[i], reference, buffer_size); allocator->Free(buffers[i]); } }; for (int i = 0; i < 10; i++) { std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now(); for (int t = 0; t < thread_count; t++) { threads[t] = std::thread(func, t); } for (int t = 0; t < thread_count; t++) { threads[t].join(); } std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count(); std::cout << duration << std::endl; } DebugBreak(); return 0; } </code></pre> Under all sane conditions, <code>BasicAlloc</code> is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which <code>ThreadAlloc</code> could outperform it. <code>ThreadAlloc</code> is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!) However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where <code>BasicAlloc</code> actually scales negatively, while <code>ThreadAlloc</code> just "stalls" on the base line overhead of soft faults. If you profile the two different allocation strategies, you can see that for a low thread count, <code>KiPageFault</code> shifts from <code>memcpy</code> on <code>BasicAlloc</code> to <code>VirtualLock</code> on <code>ThreadAlloc</code>. For higher thread and core counts, eventually <code>ExpWaitForSpinLockExclusiveAndAcquire</code> starts emerging from virtually zero load to up to 50% with <code>BasicAlloc</code>, while <code>ThreadAlloc</code> only maintains the constant overhead from <code>KiPageFault</code> itself. Well, the stall with <code>ThreadAlloc</code> is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section. You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case. <hr> The spin-lock was also already present in the Windows 7 and earlier implementations of <code>KiPageFault</code>. So what did change? Simple answer: <code>KiPageFault</code> itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before. If someone whishes to disassemble <code>KiPageFault</code> to find the most expensive part - be my guest.

Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16)

Tags:

memory-management

windows

windows-7

windows-10

windows-server-2016

We set up two identical HP Z840 Workstations with the following specs

2 x Xeon E5-2690 v4 @ 2.60GHz (Turbo Boost ON, HT OFF, total 28 logical CPUs)
32GB DDR4 2400 Memory, Quad-channel

and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each.

Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads.

#include <Windows.h>
#include <vector>
#include <ppl.h>

unsigned __int64 ZQueryPerformanceCounter()
{
    unsigned __int64 c;
    ::QueryPerformanceCounter((LARGE_INTEGER *)&c);
    return c;
}

unsigned __int64 ZQueryPerformanceFrequency()
{
    unsigned __int64 c;
    ::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
    return c;
}

class CZPerfCounter {
public:
    CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
    void reset() { m_st = ZQueryPerformanceCounter(); };
    unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
    unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
    unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
    static unsigned __int64 frequency() { return m_freq; };
private:
    unsigned __int64 m_st;
    static unsigned __int64 m_freq;
};

unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();



int main(int argc, char ** argv)
{
    SYSTEM_INFO sysinfo;
    GetSystemInfo(&sysinfo);
    int ncpu = sysinfo.dwNumberOfProcessors;

    if (argc == 2) {
        ncpu = atoi(argv[1]);
    }

    {
        printf("No of threads %d\n", ncpu);

        try {
            concurrency::Scheduler::ResetDefaultSchedulerPolicy();
            int min_threads = 1;
            int max_threads = ncpu;
            concurrency::SchedulerPolicy policy
            (2 // two entries of policy settings
                , concurrency::MinConcurrency, min_threads
                , concurrency::MaxConcurrency, max_threads
            );
            concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
        }
        catch (concurrency::default_scheduler_exists &) {
            printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n");
        }

        static int cnt = 100;
        static int num_fills = 1;
        CZPerfCounter pcTotal;

        // malloc/free
        printf("malloc/free\n");
        {
            CZPerfCounter pc;
            for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
                concurrency::parallel_for(0, 50, [i](size_t x) {
                    std::vector<void *> ptrs;
                    ptrs.reserve(cnt);
                    for (int n = 0; n < cnt; n++) {
                        auto p = malloc(i);
                        ptrs.emplace_back(p);
                    }
                    for (int x = 0; x < num_fills; x++) {
                        for (auto p : ptrs) {
                            memset(p, num_fills, i);
                        }
                    }
                    for (auto p : ptrs) {
                        free(p);
                    }
                });
                printf("size %4d MB,  elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
                pc.reset();
            }
        }
        printf("\n");
        printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0);
    }

    return 0;
}

Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability.

Windows 10 memory access is not scalable

We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10.

Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit.

*** EDITED

Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue.

enter image description here

*** EDITED

Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU.

enter image description here

*** EDITED

It happens in Server 2016 as well. I added the tag windows-server-2016.

*** EDITED

Using info from @Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock.

enter image description here

258

asked Jul 11 '17 01:07

nikoniko

2 Answers

Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.

Here is the updated graph.

enter image description here

Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.

103

answered Sep 26 '22 01:09

nikoniko

Unfortunately not an answer, just some additional insight.

Little experiment with a different allocation strategy:

#include <Windows.h>

#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <atomic>
#include <iostream>
#include <chrono>

class AllocTest
{
public:
    virtual void* Alloc(size_t size) = 0;
    virtual void Free(void* allocation) = 0;
};

class BasicAlloc : public AllocTest
{
public:
    void* Alloc(size_t size) override {
        return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    }
    void Free(void* allocation) override {
        VirtualFree(allocation, NULL, MEM_RELEASE);
    }
};

class ThreadAlloc : public AllocTest
{
public:
    ThreadAlloc() {
        t = std::thread([this]() {
            std::unique_lock<std::mutex> qlock(this->qm);
            do {
                this->qcv.wait(qlock, [this]() {
                    return shutdown || !q.empty();
                });
                {
                    std::unique_lock<std::mutex> rlock(this->rm);
                    while (!q.empty())
                    {
                        q.front()();
                        q.pop();
                    }
                }
                rcv.notify_all();
            } while (!shutdown);
        });
    }
    ~ThreadAlloc() {
        {
            std::unique_lock<std::mutex> lock1(this->rm);
            std::unique_lock<std::mutex> lock2(this->qm);
            shutdown = true;
        }
        qcv.notify_all();
        rcv.notify_all();
        t.join();
    }
    void* Alloc(size_t size) override {
        void* target = nullptr;
        {
            std::unique_lock<std::mutex> lock(this->qm);
            q.emplace([this, &target, size]() {
                target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
                VirtualLock(target, size);
                VirtualUnlock(target, size);
            });
        }
        qcv.notify_one();
        {
            std::unique_lock<std::mutex> lock(this->rm);
            rcv.wait(lock, [&target]() {
                return target != nullptr;
            });
        }
        return target;
    }
    void Free(void* allocation) override {
        {
            std::unique_lock<std::mutex> lock(this->qm);
            q.emplace([allocation]() {
                VirtualFree(allocation, NULL, MEM_RELEASE);
            });
        }
        qcv.notify_one();
    }
private:
    std::queue<std::function<void()>> q;
    std::condition_variable qcv;
    std::condition_variable rcv;
    std::mutex qm;
    std::mutex rm;
    std::thread t;
    std::atomic_bool shutdown = false;
};

int main()
{
    SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);

    BasicAlloc alloc1;
    ThreadAlloc alloc2;

    AllocTest *allocator = &alloc2;
    const size_t buffer_size =1*1024*1024;
    const size_t buffer_count = 10*1024;
    const unsigned int thread_count = 32;

    std::vector<void*> buffers;
    buffers.resize(buffer_count);
    std::vector<std::thread> threads;
    threads.resize(thread_count);
    void* reference = allocator->Alloc(buffer_size);

    std::memset(reference, 0xaa, buffer_size);

    auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {
        for (int i = thread_id; i < buffer_count; i+= thread_count) {
            buffers[i] = allocator->Alloc(buffer_size);
            std::memcpy(buffers[i], reference, buffer_size);
            allocator->Free(buffers[i]);
        }
    };

    for (int i = 0; i < 10; i++)
    {
        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
        for (int t = 0; t < thread_count; t++) {
            threads[t] = std::thread(func, t);
        }
        for (int t = 0; t < thread_count; t++) {
            threads[t].join();
        }
        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        std::cout << duration << std::endl;
    }


    DebugBreak();
    return 0;
}

Under all sane conditions, BasicAlloc is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which ThreadAlloc could outperform it. ThreadAlloc is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!)

However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where BasicAlloc actually scales negatively, while ThreadAlloc just "stalls" on the base line overhead of soft faults.

If you profile the two different allocation strategies, you can see that for a low thread count, KiPageFault shifts from memcpy on BasicAlloc to VirtualLock on ThreadAlloc.

For higher thread and core counts, eventually ExpWaitForSpinLockExclusiveAndAcquire starts emerging from virtually zero load to up to 50% with BasicAlloc, while ThreadAlloc only maintains the constant overhead from KiPageFault itself.

Well, the stall with ThreadAlloc is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section.

You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case.

The spin-lock was also already present in the Windows 7 and earlier implementations of KiPageFault. So what did change?

Simple answer: KiPageFault itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before.

If someone whishes to disassemble KiPageFault to find the most expensive part - be my guest.

answered Sep 24 '22 01:09

Ext3h

Related questions
                            
                                Is there a way to decode numerical COM error-codes in pywin32
                            
                                Gource on Windows
                            
                                Nodemon Doesn't Restart in Windows Docker Environment
                            
                                Remove proxy settings from the windows command prompt
                            
                                Vagrant is not forwarding when I run django runserver on ssh
                            
                                Shared Libraries: Windows vs Linux method
                            
                                Printing PDFs from Windows Command Line
                            
                                ant machine name property
                            
                                Add my own compiler warning
                            
                                No module named 'winrandom' when using pycrypto
                            
                                How to reset the user/password of Jenkins on Windows?
                            
                                How to install Orca - which Windows SDK(s) contain the Orca MSI editing tool?
                            
                                How to enable GZip compression in XAMPP server
                            
                                Windows batch script to read an .ini file
                            
                                Tomcat multiple instances simultaneously
                            
                                Getting My Documents path in Java
                            
                                Check the file-size without opening file in C++?
                            
                                run nginx as windows service
                            
                                Customers angry, fighting unknown DLL dependencies
                            
                                Get CPU Temperature

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16)

Tags:

memory-management

windows

windows-7

windows-10

windows-server-2016

nikoniko

People also ask

2 Answers

nikoniko

Ext3h

Recent Activity

Donate For Us