Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understand whether code sample is CPU bound or Memory bound

As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?

Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?

Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?

For example, I'm currently doing the following:

I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.

/*Copy 26 matrices to temporary storage*/
for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors 
    for(int k=0;k<26;k++){//Loop over the 26 matrices
       /*
        Perform MVM in blocks of '2' between kth matrix and 
        'i' and 'i+1' vector
       */     

    }
}

The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.

I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.

I have done all this assuming its memory bound but I want to know for sure. Is there a way?

like image 527
user1715122 Avatar asked Feb 20 '13 19:02

user1715122


People also ask

How do I know if my code is CPU-bound?

A program is CPU bound if it would go faster if the CPU were faster, i.e. it spends the majority of its time simply using the CPU (doing calculations).

How do you check if a process is CPU-bound or IO bound?

To determine if a system is CPU bound, run sar -u (or cpusar -u for each processor on a system with an SCO SMP License) and examine the %idle value. If %idle is consistently less than 5% (for all CPUs) on a heavily loaded database server system, then the system may be lacking in processing power.

What tasks are CPU-bound?

CPU Bound. We can say that a computer is CPU-bound (or compute-bound) when the time for it to complete a task is correlated to the speed of the central processor. Tasks that do a lot of calculations with comparatively little data are often CPU bound.

Can a measure of whether a process is likely to be CPU-bound or I O bound be determined by analyzing source code How can this be determined at run time?

An I/O bound process would require a number of read/write operations whereas a CPU bound process would be busy in computations. So, it can be determinate from the source code itself whether the process would be CPU bound or I/O bound.


1 Answers

To my understanding the best way is profiling your application/workload. Based on the input data, the characteristic of the application/workload can significantly vary. These behaviors can however be quantified with to few phases Ref[2, 3] and a histogram can broadly tell the most frequent path of the workload to be optimized. The question that you are asking will also require benchmark programs [like SPEC2006, PARSEC, Media bench etc] for an architecture and is difficult to answer in general terms ( and is an active part of research in computer architecture). However, for specific cases a quantitative result can be stated for different memory hierarchies. You can use tools like:

  • Perf
  • OProfile
  • VTune
  • LikWid
  • LLTng

and other monitoring and simulation tools to get the profiling traces of the application. You can look at performance counters like IPC, CPI ( for CPU bound) and memory access, cache misses, cache access , and other memory counters for determining memory boundedness.like IPC, Memory access per cycle (MPC), is often used to determine the memory boundedness of an application/workload. To specifically improve matrix multiplication, I would suggest using a optimized algorithm as in LAPACK.

like image 95
Shan Avatar answered Oct 06 '22 04:10

Shan