Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel fill std::vector with zero

I want to fill a std::vector<int> with zero with openmp. How to do that quickly?

I heard that looping over the vector to set each element to zero was slow, and std::fill was much faster. Is that still true now?

Fastest way to reset every value of std::vector<int> to 0

Do I have to manually divide the std::vector<int> into regions, use #pragma omp for loop over each thread, and then use std::fill in the loop?

like image 736
hamster on wheels Avatar asked Feb 04 '17 20:02

hamster on wheels


People also ask

How do I reset a vector to zero?

If it's just a vector of integers, I'd first try: memset(&my_vector[0], 0, my_vector. size() * sizeof my_vector[0]);

How do you declare a vector with all elements to zero in C++?

You can use: std::vector<int> v(100); // 100 is the number of elements. // The elements are initialized with zero values.


1 Answers

You can split the vector into chunks for each thread to be filled with std::fill:

#pragma omp parallel
{   
    auto tid = omp_get_thread_num();
    auto chunksize = v.size() / omp_get_num_threads();
    auto begin = v.begin() + chunksize * tid;
    auto end = (tid == omp_get_num_threads() -1) ? v.end() : begin + chunksize);
    std::fill(begin, end, 0);
}

You can further improve it by rounding chunksize to the nearest cacheline / memory word size (128 byte = 32 ints). Assuming that v.data() is aligned similarly. That way, you avoid any false sharing issues.

On a dual socket 24 core Haswell system, I get a speedup of somewhere near 9x: 3.6s for 1 thread, to 0.4s for 24 threads, 4.8B ints = ~48 GB/s, the results vary a bit and this is not a scientific analysis. But it is not too far off the memory bandwidth of the system.

For general performance, you should be concerned about dividing your vector not only for this operation, but also for further operations (be it read or write) the same way if possible. That way, you increase the chance that the data is actually in cache if you need it, or at least on the same NUMA node.

Oddly enough, on my system std::fill(..., 1); is faster than std::fill(..., 0) for a single thread, but slower for 24 threads. Both with gcc 6.1.0 and icc 17.0.1. I guess I'll post that into a separate question.

like image 129
Zulan Avatar answered Oct 27 '22 22:10

Zulan