Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

Tags:

c

#include <stdio.h>
#include <time.h>

#define N  32768

char a[N][N];
char b[N][N];

int main() {
    int i, j;

    printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
    printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);

    clock_t start = clock();
    for (j = 0; j < N; j++)
        for (i = 0; i < N; i++)
            a[i][j] = b[i][j];
    clock_t end = clock();
    float seconds = (float)(end - start) / CLOCKS_PER_SEC;
    printf("time taken: %f secs\n", seconds);

    start = clock();
    for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            a[i][j] = b[i][j];
    end = clock();
    seconds = (float)(end - start) / CLOCKS_PER_SEC;
    printf("time taken: %f secs\n", seconds);

    return 0;
}

Output:

address of a[32768][32768] = 0x80609080
address of b[    0][    0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs

Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of enter image description here in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.

785

asked Dec 15 '15 23:12

Kakaji

2 Answers

It pretty much comes down to this image (source):

enter image description here

When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.

The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.

Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.

102

answered Sep 19 '22 08:09

fresskoma

While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).

Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...

In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.

If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.

In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.

In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.

answered Sep 17 '22 08:09

Gene

Related questions
                            
                                Differences between <semaphore.h> and <sys/sem.h>
                            
                                Need help understanding this for loop code in C
                            
                                C(++) struct force extra padding
                            
                                Preference between memcpy and dereference
                            
                                A C routine to round a float to n significant digits?
                            
                                Are there templates in the C programming language?
                            
                                How do you convert a MAC address (in an array) to string in C?
                            
                                Fast pseudorandom number generator for cryptography in C
                            
                                Is it possible to make efficient pointer-based binary heap implementations?
                            
                                Working with a union of structs in C
                            
                                Logging enums on the pebble watch
                            
                                fgets to read line by line in files
                            
                                Size of a pointer pointing to a structure [duplicate]
                            
                                I used wait(&status) and the value of status is 256, why?
                            
                                why to write C code like "while((void)0, 0)"
                            
                                Create a simple Client/Server using Modbus in C
                            
                                Using _crtBreakAlloc to find memory leaks - identifier "_crtBreakAlloc" is unidentified
                            
                                How to correctly use socket close for fork?
                            
                                Pointers and "Storing unsafe C derivative of temporary Python reference"
                            
                                In C, how do I write to a particular memory location e.g. video memory b800, in DOS (real DOS, MS DOS 6.22)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With