Why is memcpy slower than a reinterpret_cast when parsing binary data?

Question

TLDR: I forgot to enable compiler optimizations. With the optimizations enabled the performance is (nearly) identical.

Original post

When reading integer from binary data I noticed that memcpy is slower than a casting solution.

Version 1: reinterpret_cast, smelly due to potential alignment problems, but also faster (?)

int get_int_v1(const char * data) { return *reinterpret_cast<const int*>(data); }

Version 2: memcpy, correct and a little slower:

int get_int_v2(const char * data) { int result; memcpy(&result, data, sizeof(result)); return result; }

I have a benchmark on Ideone.

For future reference, the code is:

#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iostream>
#include <vector>
#include <sys/time.h>

double get_current_time()
{
    timeval tv;
    gettimeofday(&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

int get_int_v1(const char * data) { return *reinterpret_cast<const int*>(data); }
int get_int_v2(const char * data) { int result; memcpy(&result, data, sizeof(result)); return result; }

const unsigned iterations = 200 * 1000 * 1000;

double test_v1(const char * c, unsigned & prevent_optimization)
{
    double start = get_current_time();
    for (unsigned i = 0; i != iterations; ++i)
    {
        prevent_optimization += get_int_v1(c);
    }
    return get_current_time() - start;
}

double test_v2(const char * c, unsigned & prevent_optimization)
{
    double start = get_current_time();
    for (unsigned i = 0; i != iterations; ++i)
    {
        prevent_optimization += get_int_v2(c);
    }
    return get_current_time() - start;
}

int main()
{
    srand(time(0));

    // Initialize data
    std::vector<int> numbers(1000);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = i;
    }

    // Repeat benchmark 4 times.
    for (unsigned i = 0; i != 4; ++i)
    {
        unsigned p = 0;
        std::vector<int>::size_type index = rand() % numbers.size();
        const char * c = reinterpret_cast<const char *>(&numbers[index]);    
        std::cout << "v1: " << test_v1(c, p) << std::endl;
        std::cout << "v2: " << test_v2(c, p) << std::endl << std::endl;
    }
}

And the results are:

v1: 0.176457
v2: 0.557588

v1: 0.17654
v2: 0.220581

v1: 0.176826
v2: 0.22012

v1: 0.176131
v2: 0.220633

My questions are:

Is my benchmark correct?
If yes, then why is v2 (with memcpy) slower? Since both version return a copy of the data I think there should be no difference in performance.
How can I implement a solution that is correct and fast?

Update

I was being silly and forgot to consider that Ideone doesn't perform compiler optimizations. I also tweaked the code a little and came up with the following:

#include <algorithm>
#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iomanip> 
#include <iostream> 
#include <vector>
#include <sys/time.h>

double get_current_time()
{
    timeval tv;
    gettimeofday(&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

struct test_cast
{
    int operator()(const char * data) const 
    {
        return *((int*)data);
    }
};

struct test_memcpy
{
    int operator()(const char * data) const 
    {
        int result;
        memcpy(&result, data, sizeof(result));
        return result;
    }
};

struct test_std_copy
{
    int operator()(const char * data) const 
    {
        int result;
        std::copy(data, data + sizeof(int), reinterpret_cast<char *>(&result));
        return result;
    }
};

enum
{
    iterations = 2000,
    container_size = 2000
};

std::vector<int> get_random_numbers()
{
    std::vector<int> numbers(container_size);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = rand();
    }
    return numbers;
}

std::vector<int> get_random_indices()
{
    std::vector<int> numbers(container_size);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = i;
    }
    std::random_shuffle(numbers.begin(), numbers.end());
    return numbers;
}

template<typename Function>
unsigned benchmark(const Function & f, unsigned & counter)
{
    std::vector<int> container = get_random_numbers();
    std::vector<int> indices = get_random_indices();
    double start = get_current_time();
    for (unsigned iter = 0; iter != iterations; ++iter)
    {
        for (unsigned i = 0; i != container.size(); ++i)
        {
            counter += f(reinterpret_cast<const char*>(&container[indices[i]]));
        }
    }
    return unsigned(0.5 + 1000.0 * (get_current_time() - start));
}

int main()
{
    srand(time(0));
    unsigned counter = 0;

    std::cout << "cast:      " << benchmark(test_cast(),     counter) << " ms" << std::endl;
    std::cout << "memcpy:    " << benchmark(test_memcpy(),   counter) << " ms" << std::endl;
    std::cout << "std::copy: " << benchmark(test_std_copy(), counter) << " ms" << std::endl;
    std::cout << "(counter:  " << counter << ")" << std::endl << std::endl;

}

The results are now nearly equal (except for std::copy which is slower for some reason):

g++ -o test -O0 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      56 ms
memcpy:    60 ms
std::copy: 290 ms
(counter:  2854155632)

g++ -o test -O1 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      9 ms
memcpy:    14 ms
std::copy: 20 ms
(counter:  3524665968)

g++ -o test -O2 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      4 ms
memcpy:    5 ms
std::copy: 20 ms
(counter:  2590914608)

g++ -o test -O3 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      4 ms
memcpy:    5 ms
std::copy: 18 ms
(counter:  2590914608)

Steve Jessop · Accepted Answer

You need to look at the emitted code. Obviously the optimizer "should" be able to turn the memcpy into a single potentially-unaligned int-sized read into the return value, but if you see different times then I reckon on x86 that means it hasn't.

On my machine, using gcc with -O2 I get 0.09 for all times. With -O3 I get 0 for all times (I haven't checked whether that's faster than the time granularity, or that the optimizer has removed all your code).

So fairly likely, the answer is just that you haven't used the right compiler flags (or ideone hasn't).

On an architecture where a potentially-unaligned read requires different instructions from an aligned read, then the reinterpret_cast could emit an aligned read while the memcpy might have to emit an unaligned read (depending how the function is called -- in this case the data is in fact aligned but I don't know under what conditions the compiler can prove that). In that case I would expect that the reinterpret_cast code could be faster than the memcpy, but of course it would be incorrect in the case where someone passes in an unaligned pointer.

SomeWittyUsername · Answer

Casting is a compile-time operation while memcpy() is a run-time operation. That's the reason for casting having no impact on the running time.

MSalters · Answer

memcpy cannot copy to a register, it does a memory-to-memory copy. The reinterpret_cast in get_int_v1 can change the type of pointer held in a register, and that doesn't even require a register-to-register copy.

Why is memcpy slower than a reinterpret_cast when parsing binary data?

Tags:

c++

performance

TLDR: I forgot to enable compiler optimizations. With the optimizations enabled the performance is (nearly) identical.

Original post

Update

StackedCrooked

3 Answers

Steve Jessop

SomeWittyUsername

MSalters

Recent Activity

Donate For Us

Why is memcpy slower than a reinterpret_cast when parsing binary data?

Tags:

c++

performance

TLDR: I forgot to enable compiler optimizations. With the optimizations enabled the performance is (nearly) identical.

Original post

Update

StackedCrooked

3 Answers

Steve Jessop

SomeWittyUsername

MSalters

Related questions

Recent Activity

Donate For Us