C# TPL faster than C++ PPL?

Question

I coded a very simple app that used the Fibonacci fonction to compare the TPL's Parallel.ForEach vs PPL's parallel_for_each, and the result was really strange, on pc with 8 cores, the c# is 11 seconds faster then c++.

The same result on both vs2010 and vs 2011 preview.

C# code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {

            static void Main(string[] args)
            {
                var ll = new ConcurrentQueue<Tuple<int, int>>();
                var a = new int[12] { 40, 41, 42, 43, 44, 45, 46, 47, 35, 25, 36, 37 };

                long elapsed = time_call(() =>
                {
                    Parallel.ForEach(a, (n) => { ll.Enqueue(new Tuple<int, int>(n, fibonacci(n))); });
                });

                Console.WriteLine("TPL C# elapsed time: " + elapsed + "

");
                foreach (var ss in ll)
                {
                    Console.WriteLine(String.Format("fib<{0}>: {1}", ss.Item1, +ss.Item2));
                }

                 Console.ReadLine();
            }

            static long time_call(Action f)
            {
                var p = Stopwatch.StartNew();
                p.Start();
                f();
                p.Stop();
                return p.ElapsedMilliseconds;
            }

             Computes the nth Fibonacci number.
            static int fibonacci(int n)
            {
                if (n < 2) return n;
                return fibonacci(n - 1) + fibonacci(n - 2);
            }
        }
    }

c++ code:

#include <windows.h>
#include <ppl.h>
#include <concurrent_vector.h>
#include <array>
#include <tuple>
#include <algorithm>
#include <iostream>

using namespace Concurrency;
using namespace std;

template <class Function>
__int64 time_call(Function&& f) {
    __int64 begin = GetTickCount();
    f();
    return GetTickCount() - begin;
}

// Computes the nth Fibonacci number.
int fibonacci(int n) {
    if (n < 2) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}

int wmain() {
    __int64 elapsed;
    array<int, 12> a ={ 40, 41, 42, 43, 44, 45, 46, 47, 35, 25, 36, 37 };
    concurrent_vector<tuple<int,int>> results2;

    elapsed = time_call([&]{
        parallel_for_each(a.begin(), a.end(), [&](int n) {
            results2.push_back(make_tuple(n, fibonacci(n)));
        });
    });  

    wcout << L"PPL  time: " << elapsed << L" ms" << endl << endl;
    for_each (results2.begin(), results2.end(), [](tuple<int,int>& pair) {
        wcout << L"fib(" << get<0>(pair) << L"): " << get<1>(pair) << endl;
    });

    cin.ignore();
}

Can you please point me, where part of my c++ code I m wrong?

Width group_task i have the same time like c# code:

task_group tasks;
    elapsed = time_call([&] 
    {
        for_each(begin(a), end(a), [&](int n) 
        {
            tasks.run([&,n]{results2.push_back(make_tuple(n, fibonacci(n)));});
        });
        tasks.wait();

user1127781 · Accepted Answer

Here are the explanations by Rahul v Patil Microsoft team

Hello,

Thanks for bringing this up. Indeed, you've identified the overhead associated with the default parallel for * - especially when the number of iterations are small, and the work size is variable. The default parallel for starts off by breaking down the work into 8 chunks (on 8 cores). As the work finishes, work is dynamically load-balanced. The default works great in most cases (large number of iterations), and when the underlying work per iteration is not well understood (let's say you call into a library) - but it does come with unacceptable overheads in some cases.

The solution is exactly what you've identified in your alternate implemtnation. To that effect, we'll have a parallel for partitioner called "simple" in next version of Visual Studio, which will be similar to the alternate implementation you describe and will have much better performance.

PS: The C# and C++ parallel for each implementations use slightly different algorithms in how they go through the iterations - hence you will see slightly different performance characteristics depending on the workload.

Regards

Lucidio Kuhn · Answer

There are some issues with your code, let’s address them one by one:

Using recursion to calculate Fibonacci causes the process to use an inordinate amount of memory, since it is using the call stack to calculate the result. Having recursive Fibonacci means you are not comparing the C# / C++ parallel frameworks, you are comparing the call stack mechanisms. You can calculate Fibonacci much faster:

int fibonacci(int n) 
{
    int curr = 1, prev = 0, total = 0;
    for (int i = 0; i < n; i++)
    {
        int pc = curr;
        curr += prev;
        total += curr;
        prev = pc;
    }
    return total;
}

With that function, I had to run at least 1 million times to get measureable values.

Use GetTickCount64 instead of GetTickCount:

template <class Function>
__int64 time_call(Function&& f) 
{
    __int64 begin = ::GetTickCount64();
    f();
    return GetTickCount64() - begin;
}

Running parallel_for with such small body can be actually detrimental to performance. It is better to use a more granular functor.

With these traits in mind, here is the code in C++:

// ParallelFibo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <ppl.h>
#include <concurrent_vector.h>
#include <array>
#include <tuple>
#include <algorithm>
#include <iostream>
#include <random>

using namespace Concurrency;
using namespace std;

template <class Function>
__int64 time_call(Function&& f) 
{
    __int64 begin = ::GetTickCount64();
    f();
    return GetTickCount64() - begin;
}

// Computes the nth Fibonacci number.
inline int fibonacci(int n) 
{
    int curr = 1, prev = 0, total = 0;
    for (int i = 0; i < n; i++)
    {
        int pc = curr;
        curr += prev;
        total += curr;
        prev = pc;
    }
    return total;
}

#define NUMBER_OF_REPETITIONS 1000000
#define MIN_FIBO              37
#define MAX_FIBO              49

int main()
{
    __int64 elapsed;
    vector<int> v;
    for (int i = MIN_FIBO; i < MAX_FIBO; i++)
    {
        v.emplace_back(i);
    }

    concurrent_vector<tuple<int, int>> results;
    elapsed = time_call([&] {
        parallel_for(MIN_FIBO, MAX_FIBO, [&](int n) {
            for (int i = 0; i < NUMBER_OF_REPETITIONS; i++)
            {
                results.push_back(make_tuple(n, fibonacci(n)));
            }
        });
    });
    wcout << L"PPL  time: " << elapsed << L" ms" << endl << endl;
    cin.ignore();
    return 0;
}

And here is the code in C#:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Diagnostics;

namespace ParallelFiboCS
{
    class Program
    {
        const int NUMBER_OF_REPETITIONS = 1000000;
        const int MIN_FIBO = 37;
        const int MAX_FIBO = 49;
        static void Main(string[] args)
        {
            var ll = new ConcurrentQueue<Tuple<int, int>>();

            var a = new int[MAX_FIBO - MIN_FIBO];
            for (int i = MIN_FIBO; i < MAX_FIBO; i++)
            {
                a[i - MIN_FIBO] = i;
            }

            long elapsed = time_call(() =>
            {
                Parallel.ForEach(a, (n) =>
                {
                    for (int i = 0; i < NUMBER_OF_REPETITIONS; i++)
                    {
                        ll.Enqueue(new Tuple<int, int>(n, fibonacci(n)));
                    }
                });
            });

            Console.WriteLine("TPL C# elapsed time: " + elapsed + "

");

            Console.ReadLine();
        }

        static long time_call(Action f)
        {
            var p = Stopwatch.StartNew();
            p.Start();
            f();
            p.Stop();
            return p.ElapsedMilliseconds;
        }

        static int fibonacci(int n)
        {
            int curr = 1, prev = 0, total = 0;
            for (int i = 0; i < n; i++)
            {
                int pc = curr;
                curr += prev;
                total += curr;
                prev = pc;
            }
            return total;
        }
    }
}

Average time to run 12 million Fibonacci calculations for numbers between 37 and 49:

C++: 513ms

C#: 2527ms

C# TPL faster than C++ PPL?

Tags:

c++

c#-4.0

user1127781

2 Answers

user1127781

Lucidio Kuhn

Recent Activity

Donate For Us

C# TPL faster than C++ PPL?

Tags:

c++

c#-4.0

user1127781

2 Answers

user1127781

Lucidio Kuhn

Related questions

Recent Activity

Donate For Us