Let A
be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A
at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
Summary. Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.
The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.
In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.
It is possible to tackle though if you build up the solution recursively and use conditional probabilities.
Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.
In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.
Apologies for the VBA code, but VBA was all I had available when motivated to answer :).
Function nCr#(ByVal n#, ByVal r#)
Static combin#()
Static size#
Dim i#, j#
If n = r Then
nCr = 1
Exit Function
End If
If n > size Then
ReDim combin(0 To n, 0 To n)
combin(0, 0) = 1
For i = 1 To n
combin(i, 0) = 1
For j = 1 To i
combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
Next
Next
size = n
End If
nCr = combin(n, r)
End Function
Function p_binom#(n#, r#, p#)
p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function
Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
bucket#, total_buckets#, bucket_capacity#)
If balls > bucket_capacity Then
p_next_bucket_balls = 0
Else
p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
End If
End Function
Function p_capped_buckets#(n#, cap#)
Dim p_prior, p_update
Dim bucket#, balls#, prior_balls#
ReDim p_prior(0 To n)
ReDim p_update(0 To n)
p_prior(0) = 1
For bucket = 1 To n
For balls = 0 To n
p_update(balls) = 0
For prior_balls = 0 To balls
p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
Next
Next
p_prior = p_update
Next
p_capped_buckets = p_update(n)
End Function
Function expected_max_buckets#(n#)
Dim cap#
For cap = 0 To n
expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
Next
End Function
Sub test32()
Dim p_cumm#(0 To 32)
Dim cap#
For cap# = 0 To 32
p_cumm(cap) = p_capped_buckets(32, cap)
Next
For cap = 1 To 32
Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
Next
End Sub
For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.
Output to compare to ahmad's:
1 0.000000
2 0.029273
3 0.516311
4 0.361736
5 0.079307
6 0.011800
7 0.001417
8 0.000143
9 0.000012
10 0.000001
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 0.000000
21 0.000000
22 0.000000
23 0.000000
24 0.000000
25 0.000000
26 0.000000
27 0.000000
28 0.000000
29 0.000000
30 0.000000
31 0.000000
32 0.000000
I'll try a math answer, although I don't have it quite right yet.
You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__
array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).
For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000
int main(){
int i=0,j=0;
int *bank=(int*)malloc(sizeof(int)*NBANK);
int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
memset(maxCount, 0, sizeof(int)*(NBANK+1));
for (int i=0; i<NSAMPLE; ++i) {
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(bank, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
bank[randomNumber[j]]++;
}
for(j=0; j<WARPSIZE; j++)
max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);
// store statistic
maxCount[max_bank_conflict]++;
}
// report statistic
printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
float expected = 0;
for(i=1; i<NBANK+1; i++) {
float prob = maxCount[i]/(float)NSAMPLE;
printf("%02d -> %6.4f\n", i, prob);
expected += prob * i;
}
printf("Expected maximum bank conflict degree = %6.4f\n", expected);
return 0;
}
Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i))
, for i
from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.
Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.
This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With