Is it possible to reduce cache miss when visiting a large size vector in sequence?

Question

I tring reduce cache miss when visiting a large array in sequence. The array size is about 10 million. I do not know if it is possible to reduce cache miss as much as possible ?

There are two arrays: one is raw array loaded from data file (ds_append_), another(ds_) will updated from the first. Sample data type define as follow:

/**
 * @brief Raw sample data structure which loads from the file.
 */
struct sample_data_app_t {
  point_double_t sample_data_{};  //!< sample data's raw data.
  int32_t cid_{};                 //!< sample data's Channel's ID.
  int32_t dbg_blk_id_{-1}, sid_{};
};

/**
 * @brief Sample data structure.
 * @details Sample data structure contains sample point readed from the file,
 * mapped-xy calculated by the R/W configures, and display-xy calculated by the
 * chart configures. @see ch_config_t
 */
struct sample_data_t {
  int32_t cid_{};                //!< sample data's Channel's ID (0 ~ 8).
  point_double_t mapped_pos_{};  //!< sample data's mapped position.
  sample_data_app_t* aptr_{};    //!< sample data's append pointer.
};

/**
 * @brief Sample dataset structure.
 * @details Sample dataset structure contains a vector of sample data.
 */
struct sample_dataset_t {
  std::vector<sample_data_t> ds_;                   //!< sample dataset's sample data vector.
  std::shared_ptr<sample_data_app_t[]> ds_append_;  //!< sample dataset's sample data append vector.
  size_t array_size_{0};                            //!< sample dataset's sample data array size.
};

the following code snippet response for update ds_ from ds_append_:

  auto update_func = [&](sample_data_t& sample) {
      const auto cid = sample.cid_; // channel index 0 ~ 8
      const auto& ch_config = chConfigs[cid];
      const auto& sample_range = sampleDataConfig.ch_limits_[cid];

      const auto ax = sample.aptr_->sample_data_.x_;
      const auto ay = sample.aptr_->sample_data_.y_;

      const auto slen = sample_range.y_max_ - sample_range.y_min_;
      const auto ys = (ay - sample_range.y_min_) * 256.0 / slen;

      const double yy1 = std::abs(ch_config.max_spc_) * ch_config.cos_x_ * ys / 256.0;
      const double mapped_y = ch_config.os_y_ + yy1 * ch_config.scale_y_;

      const double mapped_x = ch_config.os_x_ + ax * ch_config.unit_scale_x_;
      sample.mapped_pos_ = {mapped_x, mapped_y};
    };

    const auto data_num = ds.ds_.size();
    const auto thd_num = std::thread::hardware_concurrency();
    if (data_num > kParallelThreshold) {
      tbb::parallel_for(
        tbb::blocked_range<size_t>(0, data_num, data_num / thd_num),
        [&](const tbb::blocked_range<size_t>& r) {
          for (size_t i = r.begin(); i != r.end(); i++) {
            update_func(ds.ds_[i]);
          }
        },
        tbb::static_partitioner{});
    } else {
      std::for_each(ds.ds_.begin(), ds.ds_.end(), update_func);
    }

Note1: The cid is randomly from 0 to 8 for each sample;
Note2: length of chConfigs[] and sampleDataConfig.ch_limits_[] is 9.

There's some memory bound in the test. One of these is

const auto cid = sample.cid_; // channel index 0 ~ 8
const auto& ch_config = chConfigs[cid];
const auto& sample_range = sampleDataConfig.ch_limits_[cid];

enter image description here

I'm confused why the address update instruction lea (%r8, %r8, 2), %r8 take so high memory bound ?

The test code repo: github -- perf_tests -- test_array_update

Ahmed AEK · Accepted Answer

I just ran this code through a profiler, your code is hitting the max DRAM bandwidth, for example 30 GB/s on this machine, DRAM sticks do have a limit to how much data they can transfer per second.

The most obvious solution is to throw money at the problem and use a computer with more RAM channels and faster DRAM sticks. (Fyi GPUs have faster DRAM)

From code standpoint you need to do more work per byte read or written, this translates to being cache friendly, you need to chain multiple operations/transformations per iteration instead of using multiple sequential parallel_for loops.

Writing more Cache friendly code can be done using C++20 ranges to construct data processing pipelines instead of doing multiple raw loops, you can even process the data as it is being read from the file instead of saving it to an intermediate vector, tbb even has an example of processing data in parallel as it is read from a file

It is also recommended that you do that pipeline in batches of small size like 8 to 1024 elements to use simd without invalidating the L1 cache.

Is it possible to reduce cache miss when visiting a large size vector in sequence?

Tags:

c++

memory

cpu

intel-vtune

roderick

1 Answers

Ahmed AEK

Recent Activity

Donate For Us

Is it possible to reduce cache miss when visiting a large size vector in sequence?

Tags:

c++

memory

cpu

intel-vtune

roderick

1 Answers

Ahmed AEK

Related questions

Recent Activity

Donate For Us