Expected number of bank conflicts in shared memory at random access

Tags:

cuda

Let A be a properly aligned array of 32-bit integers in shared memory.

If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?

In other words:

__shared__ int A[N];          //N is some big constant integer
...
int v = A[ random(0..N-1) ];  // <-- expected number of bank conflicts here?

Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).

My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.

I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...

345

asked Oct 10 '12 15:10

CygnusX1

2 Answers

In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.

It is possible to tackle though if you build up the solution recursively and use conditional probabilities.

Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.

In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.

Apologies for the VBA code, but VBA was all I had available when motivated to answer :).

Function nCr#(ByVal n#, ByVal r#)
    Static combin#()
    Static size#
    Dim i#, j#

    If n = r Then
        nCr = 1
        Exit Function
    End If

    If n > size Then
        ReDim combin(0 To n, 0 To n)
        combin(0, 0) = 1
        For i = 1 To n
            combin(i, 0) = 1
            For j = 1 To i
                combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
            Next
        Next
        size = n
    End If

    nCr = combin(n, r)
End Function

Function p_binom#(n#, r#, p#)
    p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function

Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
  bucket#, total_buckets#, bucket_capacity#)

    If balls > bucket_capacity Then
        p_next_bucket_balls = 0
    Else
        p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
    End If
End Function

Function p_capped_buckets#(n#, cap#)
    Dim p_prior, p_update
    Dim bucket#, balls#, prior_balls#

    ReDim p_prior(0 To n)
    ReDim p_update(0 To n)

    p_prior(0) = 1

    For bucket = 1 To n
        For balls = 0 To n
            p_update(balls) = 0
            For prior_balls = 0 To balls
                p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
                   p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
            Next
        Next
        p_prior = p_update
    Next

    p_capped_buckets = p_update(n)
End Function

Function expected_max_buckets#(n#)
    Dim cap#

    For cap = 0 To n
        expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
    Next
End Function

Sub test32()
    Dim p_cumm#(0 To 32)
    Dim cap#

    For cap# = 0 To 32
        p_cumm(cap) = p_capped_buckets(32, cap)
    Next

    For cap = 1 To 32
        Debug.Print "    ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
    Next
End Sub

For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.

Output to compare to ahmad's:

           1            0.000000
           2            0.029273
           3            0.516311
           4            0.361736
           5            0.079307
           6            0.011800
           7            0.001417
           8            0.000143
           9            0.000012
           10           0.000001
           11           0.000000
           12           0.000000
           13           0.000000
           14           0.000000
           15           0.000000
           16           0.000000
           17           0.000000
           18           0.000000
           19           0.000000
           20           0.000000
           21           0.000000
           22           0.000000
           23           0.000000
           24           0.000000
           25           0.000000
           26           0.000000
           27           0.000000
           28           0.000000
           29           0.000000
           30           0.000000
           31           0.000000
           32           0.000000

159

answered Oct 14 '22 00:10

A. Webb

I'll try a math answer, although I don't have it quite right yet.

You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"

If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).

For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000

int main(){  
    int i=0,j=0;

    int *bank=(int*)malloc(sizeof(int)*NBANK);
    int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
    memset(maxCount, 0, sizeof(int)*(NBANK+1));

    for (int i=0; i<NSAMPLE; ++i) {
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }

        // check the bank conflict
        memset(bank, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            bank[randomNumber[j]]++;       
        }

        for(j=0; j<WARPSIZE; j++) 
            max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);

        // store statistic
        maxCount[max_bank_conflict]++;
    }

    // report statistic
    printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
    float expected = 0;
    for(i=1; i<NBANK+1; i++) {
        float prob = maxCount[i]/(float)NSAMPLE;
        printf("%02d -> %6.4f\n", i, prob);
        expected += prob * i;
    }
    printf("Expected maximum bank conflict degree = %6.4f\n", expected);
    return 0;
}

Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.

Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.

This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.

answered Oct 13 '22 23:10

harrism

Related questions
                            
                                <<< >>> cuda in vscode
                            
                                CUDA Matrix multiplication breaks for large matrices
                            
                                CUDA programming - L1 and L2 caches
                            
                                GPU reads from CPU or CPU writes to the GPU?
                            
                                What is the size of my CUDA texture memory?
                            
                                detecting NVIDIA GPUs without CUDA
                            
                                CUDA: Is the texture cache on FERMI separate from L1 cache?
                            
                                CUDA How to access constant memory in device kernel when the constant memory is declared in the host code?
                            
                                Is Cuda 6 supported with Visual Studio 2013?
                            
                                CUDA missing libGL.so libGLU.so and libX11.so
                            
                                What are templated methods in C++ / CUDA with 3 angle brackets (<<<)?
                            
                                Mathematica and CUDA
                            
                                Proper way to write kernel functions in CUDA?
                            
                                Calling a kernel from a kernel
                            
                                Control flow divergence in SIMT and SIMD
                            
                                Gradient Descent Optimization in CUDA
                            
                                How can I implement a custom atomic function involving several variables?
                            
                                Emulating FP64 with 2 FP32 on a GPU
                            
                                Tensorflow: CUDA_VISIBLE_DEVICES doesn't seem to work
                            
                                CMake CUDA separate compilation static lib link error on Windows but not on Ubuntu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With