Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to structure a C++ application to use a multicore processor

Tags:

c++

multicore

I am building an application that will do some object tracking from a video camera feed and use information from that to run a particle system in OpenGL. The code to process the video feed is somewhat slow, 200 - 300 milliseconds per frame right now. The system that this will be running on has a dual core processor. To maximize performance I want to offload the camera processing stuff to one processor and just communicate relevant data back to the main application as it is available, while leaving the main application kicking on the other processor.

What do I need to do to offload the camera work to the other processor and how do I handle communication with the main application?

Edit: I am running Windows 7 64-bit.

like image 263
Mr Bell Avatar asked Jan 30 '10 01:01

Mr Bell


People also ask

How do I use multiple cores on one app?

Application processes by default get 1 thread of execution. But during that thread the application can ask the OS for more threads. Then the application assigns a starter function for the new threads and lets them execute. Each of these threads can be assigned to separate cores and can run at the same time.

Which type of application is ideal for multicore processor?

The multicore processor is more appropriate in Adobe Premiere, Adobe Photoshop, iMovie, and other video editing software. Solidworks with computer-aided design (CAD). High network traffic and database servers. Industrial robots, for example, are embedded systems.

Can a program use multiple cores?

Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).


2 Answers

Basically, you need to multithread your application. Each thread of execution can only saturate one core. Separate threads tend to be run on separate cores. If you are insistent that each thread ALWAYS execute on a specific core, then each operating system has its own way of specifying this (affinity masks & such)... but I wouldn't recommend it.

OpenMP is great, but it's a tad fat in the ass, especially when joining back up from a parallelization. YMMV. It's easy to use, but not at all the best performing option. It also requires compiler support.

If you're on Mac OS X 10.6 (Snow Leopard), you can use Grand Central Dispatch. It's interesting to read about, even if you don't use it, as its design implements some best practices. It also isn't optimal, but it's better than OpenMP, even though it also requires compiler support.

If you can wrap your head around breaking up your application into "tasks" or "jobs," you can shove these jobs down as many pipes as you have cores. Think of batching your processing as atomic units of work. If you can segment it properly, you can run your camera processing on both cores, and your main thread at the same time.

If communication is minimized for each unit of work, then your need for mutexes and other locking primitives will be minimized. Course grained threading is much easier than fine grained. And, you can always use a library or framework to ease the burden. Consider Boost's Thread library if you take the manual approach. It provides portable wrappers and a nice abstraction.

like image 55
pestilence669 Avatar answered Sep 20 '22 06:09

pestilence669


It depends on how many cores you have. If you have only 2 cores (cpu, processors, hyperthreads, you know what i mean), then OpenMP cannot give such a tremendous increase in performance, but will help. The maximum gain you can have is divide your time by the number of processors so it will still take 100 - 150 ms per frame.

The equation is
parallel time = (([total time to perform a task] - [code that cannot be parallelized]) / [number of cpus]) + [code that cannot be parallelized]

Basically, OpenMP rocks at parallel loops processing. Its rather easy to use

#pragma omp parallel for
for (i = 0; i < N; i++)
    a[i] = 2 * i;

and bang, your for is parallelized. It does not work for every case, not every algorithm can be parallelized this way but many can be rewritten (hacked) to be compatible. The key principle is Single Instruction, Multiple Data (SIMD), applying the same convolution code to multiple pixels for example.

But simply applying this cookbook receipe goes against the rules of optimization.
1-Benchmark your code
2-Find the REAL bottlenecks with "scientific" evidence (numbers) instead of simply guessing where you think there is a bottleneck
3-If it is really processing loops, then OpenMP is for you

Maybe simple optimizations on your existing code can give better results, who knows?

Another road would be to run opengl in a thread and data processing on another thread. This will help a lot if opengl or your particle rendering system takes a lot of power, but remember that threading can lead to other kind of synchronization bottlenecks.

like image 35
Eric Avatar answered Sep 20 '22 06:09

Eric