Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is one thread faster than just calling a function, mingw

When I call function execution time is 6.8 sec. Call it from a thread time is 3.4 sec and when using 2 thread 1.8 sec. No matter what optimization I use rations stay same.

In Visual Studio times are like expected 3.1, 3 and 1.7 sec.

#include<math.h>
#include<stdio.h>
#include<windows.h>
#include <time.h>

using namespace std;

#define N 400

float a[N][N];

struct b{
    int begin;
    int end;
};

DWORD WINAPI thread(LPVOID p)
{
    b b_t = *(b*)p;

    for(int i=0;i<N;i++)
        for(int j=b_t.begin;j<b_t.end;j++)
        {
            a[i][j] = 0;
            for(int k=0;k<i;k++)
                a[i][j]+=k*sin(j)-j*cos(k);
        }

    return (0);
}

int main()
{
    clock_t t;
    HANDLE hn[2];

    b b_t[3];

    b_t[0].begin = 0;
    b_t[0].end = N;

    b_t[1].begin = 0;
    b_t[1].end = N/2;

    b_t[2].begin = N/2;
    b_t[2].end = N;

    t = clock();
    thread(&b_t[0]);
    printf("0 - %d\n",clock()-t);

    t = clock();
    hn[0] = CreateThread ( NULL, 0, thread,  &b_t[0], 0, NULL);
    WaitForSingleObject(hn[0], INFINITE );
    printf("1 - %d\n",clock()-t);

    t = clock();
    hn[0] = CreateThread ( NULL, 0, thread,  &b_t[1], 0, NULL);
    hn[1] = CreateThread ( NULL, 0, thread,  &b_t[2], 0, NULL);
    WaitForMultipleObjects(2, hn, TRUE, INFINITE );
    printf("2 - %d\n",clock()-t);

    return 0;
}

Times:

0 - 6868
1 - 3362
2 - 1827

CPU - Core 2 Duo T9300

OS - Windows 8, 64 - bit

compiler: mingw32-g++.exe, gcc version 4.6.2

edit:

Tried different order, same result, even tried separate applications. Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread

Sum of all elements after each call is the same: 3189909.237955

Cygwin result: 2.5, 2.5 and 2.5 sec Linux result(pthread): 3.7, 3.7 and 2.1 sec

@borisbn results: 0 - 1446 1 - 1439 2 - 721.

like image 999
user1978768 Avatar asked Jan 15 '13 00:01

user1978768


3 Answers

The difference is a result of something in the math library implementing sin() and cos() - if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away.

Note that I see the difference with gcc (tdm-1) 4.6.1, which is a 32-bit toolchain targeting 32 bit binaries. Optimization makes no difference (not surprising since it seems to be something in the math library).

However, if I build using gcc (tdm64-1) 4.6.1, which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32 option) or a 64-bit program (-m64).

Here are some example test runs (I made minor modifications to the source to make it C99 compatible):

  • Using the 32-bit TDM MinGW 4.6.1 compiler:

    C:\temp>gcc --version
    gcc (tdm-1) 4.6.1
    
    C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
    
    C:\temp>test
    0 - 4082
    1 - 2439
    2 - 1238
    
  • Using the 64-bit TDM 4.6.1 compiler:

    C:\temp>gcc --version
    gcc (tdm64-1) 4.6.1
    
    C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
    
    C:\temp>test
    0 - 2506
    1 - 2476
    2 - 1254
    
    C:\temp>gcc -m64 -std=gnu99 -o test.exe test.c
    
    C:\temp>test
    0 - 3031
    1 - 3031
    2 - 1539
    

A little more information:

The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin()/cos() implementations in the msvcrt.dll system DLL via a provided import library:

c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
                0x004a113c                _imp__cos

While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution:

c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
                              C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)

Update/Conclusion:

After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll's implementation of cos() I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). In the situation where the thread() function takes twice as long, the FPU is set to 64-bit precision (REAL10 or in MSVC-speak _PC_64). When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll runtime will perform the following steps in the sin() and cos() function (and probably other floating point functions):

  • save the current FPU control word
  • set the FPU control word to 0x27f (I believe it's possible for this value to be modified)
  • perform the fsin/fcos operation
  • restore the saved FPU control word

The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes.

You can solve the problem by adding the following line to main() before calling thread():

_control87( _PC_53, _MCW_PC);   // requires <float.h>
like image 78
Michael Burr Avatar answered Nov 08 '22 21:11

Michael Burr


Not a cache matterhere.

Likely different runtime libraries for user created threads and main thread. You may compare the calculations a[i][j]+=k*sin(j)-j*cos(k); in detail (numbers) for specific values of i, j, and k to confirm differences.

like image 37
Arno Avatar answered Nov 08 '22 20:11

Arno


The reason is the main thread is doing 64 bit float math and the threads are doing 53 bit math.

You can know this / fix it by changing the code to

...
extern "C" unsigned int _control87( unsigned int newv, unsigned int mask );

DWORD WINAPI thread(LPVOID p)
{
    printf( "_control87(): 0x%.4x\n", _control87( 0, 0 ) );
    _control87(0x00010000,0x00010000);
...

The output will be:

c:\temp>test   
_control87(): 0x8001f
0 - 2667
_control87(): 0x9001f
1 - 2683
_control87(): 0x9001f
_control87(): 0x9001f
2 - 1373

c:\temp>mingw32-c++ --version
mingw32-c++ (GCC) 4.6.2

You can see that 0 was going to run w/o the 0x10000 flag, but once set, runs at the same speed as 1 & 2. If you look up the _control87() function, you'll see that this value is the _PC_53 flag, which sets the precision to be 53 instead of 64 had it been left as zero.

For some reason, Mingw isn't setting it to the same value at process init time that CreateThread() does at thread create time.

Another work around it to turn on SSE2 with _set_SSE2_enable(1), which will run even faster, but may give different results.

c:\temp>test   
0 - 1341
1 - 1326
2 - 702

I believe this is on by default for the 64 bit because all 64 bit processors support SSE2.

like image 37
Tony Lee Avatar answered Nov 08 '22 22:11

Tony Lee