Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alignment of Heap Arrays in C and C++ to Ease Compiler (GCC) Vectorization

I'm currently cooking up a wrapper container template class for std::vector that automatically creates a multi-resolution pyramid of the elements in its std::vector.

The key issue now is that I want the creation of the pyramid to be (GCC) auto-vectorizable.

All the data arrays stored internally in the std::vector and in my resolution pyramid are all create on the heap using standard new or allocator template argument. Is there someway I can help the compiler to force a specific alignment on my data so that the vectorization can operate on element (arrays) (blocks) with optimal alignment (typically 16).

I'm therefore using the custom allocator AlignmentAllocator but the GCC auto-vectorization message output still claims unaligned memory in std::mr_vector::construct_pyramid line 144 in multi_resolution.hpp containing the expression

for (size_t s = 1; s < snum; s++) { // for each cached scale
...
}

as follows

tests/../multi_resolution.hpp:144: note: Detected interleaving *D.3088_68 and MEM[(const value_type &)D.3087_61]
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: found equal ranges MEM[(const value_type &)D.3087_61], *D.3082_53 and *D.3088_68, *D.3082_53
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: strided group_size = 2 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: cost model: Adding cost of checks for loop versioning aliasing.

tests/../multi_resolution.hpp:144: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
tests/../multi_resolution.hpp:144: note: Cost model analysis: 
  Vector inside of loop cost: 10
  Vector outside of loop cost: 21
  Scalar iteration cost: 5
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

tests/../multi_resolution.hpp:144: note:   Profitability threshold = 6

tests/../multi_resolution.hpp:144: note: Profitability threshold is 6 loop iterations.
tests/../multi_resolution.hpp:144: note: create runtime check for data references *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: created 1 versioning for alias checks.

tests/../multi_resolution.hpp:144: note: LOOP VECTORIZED.

Can I somehow (strongly) type-specify the alignment of a pointer value coming from memalign so that GCC can be sure that the region pointed to by data() has the required alignment (in this case 16)?

/Per

Code for mr_vector template class in multi_resolution.hpp:

/*!
 * @file: multi_resolution.hpp
 * @brief: Multi-Resolution Containers.
 * @author: Copyright (C) 2011 Per Nordlöw ([email protected])
 * @date: 2011-06-29 12:22
 */

#pragma once

#include <vector>
#include <algorithm>
#include "bitwise.hpp"
#include "mean.hpp"
#include "allocators.hpp"
#include "ostream_x.hpp"

namespace std
{

/*! Multi-Resolution Vector with Allocator Alignment for each Level. */
//template<typename _Tp, typename _Alloc = std::allocator<_Tp> >
template<typename _Tp, std::size_t _Alignment = 16>
class mr_vector
{
    // Concept requirements.
    typedef AlignmentAllocator<_Tp, _Alignment> _Alloc;
    typedef typename _Alloc::value_type                _Alloc_value_type;
    __glibcxx_class_requires(_Tp, _SGIAssignableConcept)
    __glibcxx_class_requires2(_Tp, _Alloc_value_type, _SameTypeConcept)

    typedef _Vector_base<_Tp, _Alloc>            _Base;
    typedef typename _Base::_Tp_alloc_type       _Tp_alloc_type;
public:
    typedef _Tp                                      value_type;
    typedef typename _Tp_alloc_type::pointer         pointer;
    typedef typename _Tp_alloc_type::const_pointer   const_pointer;
    typedef typename _Tp_alloc_type::reference       reference;
    typedef typename _Tp_alloc_type::const_reference const_reference;
    typedef size_t                                   size_type;
    typedef ptrdiff_t                                difference_type;
    typedef _Alloc                                   allocator_type;

protected:
    // using _Base::_M_allocate;
    // using _Base::_M_deallocate;
    // using _Base::_M_impl;
    // using _Base::_M_get_Tp_allocator;

public:
    mr_vector(size_t n)
        : m_bot(n), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }
    mr_vector(size_t n, value_type value)
        : m_bot(n, value), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }
    mr_vector(const mr_vector & in)
        : m_bot(in.m_bot), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }

    mr_vector operator = (mr_vector & in) {
        if (this != &in) {
            delete_pyramid();
            m_bot = in.m_bot;
            construct_pyramid();
        }
    }

    ~mr_vector() { delete_pyramid(); }

    // Get Standard Scale Size.
    size_type size() const { return m_bot.size(); }
    // Get Normal Scale Data.
    value_type*       data() { return m_bot.data(); }
    const value_type* data() const { return m_bot.data(); }

    // Get Size at scale @p scale.
    size_type size(size_t scale) const { return m_sizes[scale]; }

    // Get Data at scale @p scale.
    value_type*       data(size_t scale) { return m_datas[scale]; }
    const value_type* data(size_t scale) const { return m_datas[scale]; }

    // Get Standard Element at index @p i.
    value_type& operator[](size_t i) { return m_bot[i]; }
    // Get Constant Standard Element at index @p i.
    const value_type& operator[](size_t i) const { return m_bot[i]; }

    // Get Constant Standard Element at scale @p scale at index @p i.
    value_type*       operator()(size_t scale, size_t i) { return m_datas[scale][i]; }
    const value_type* operator()(size_t scale, size_t i) const { return m_datas[scale][i]; }

    void resize(size_t n) {
        bool ch = (n != size());
        if (ch) { delete_pyramid(); }
        m_bot.resize(n);
        if (ch) { construct_pyramid(); }
    }

    void push_back(const _Tp & a) {
        delete_pyramid();
        m_bot.push_back(a);
        construct_pyramid();
    }
    void pop_back() {
        if (size()) { delete_pyramid(); }
        m_bot.pop_back();
        if (size()) { construct_pyramid(); }
    }
    void clear() {
        if (size()) { delete_pyramid(); }
        m_bot.clear();
    }

    /*! Print @p v to @p os. */
    friend std::ostream & operator << (std::ostream & os,
                                       const mr_vector & v)
    {
        for (size_t s = 0; s < v.scale_count(); s++) { // for each cached scale
            os << "scale:" << s << ' ';
            print_each(os, v.m_datas[s], v.m_datas[s]+v.m_sizes[s]);
            os << std::endl;
        }
        return os;
    }

protected:
    size_t scale_count(size_t sz) const { return pnw::binlog(sz)+1; } // one extra for bottom
    size_t scale_count() const { return scale_count(size()); }

    /// Construct Pyramid Bottom-Up starting at scale @p scale.
    void construct_pyramid() {
        if (not m_datas) {      // if no multi-scala yet
            const size_t snum = scale_count();
            if (snum >= 1) {
                m_datas = new value_type* [snum]; // allocate data pointers
                m_sizes = new size_type [snum];   // allocate lengths

                // first level is just copy
                m_datas[0] = m_bot.data();
                m_sizes[0] = m_bot.size();
            }
            for (size_t s = 1; s < snum; s++) { // for each cached scale
                auto sq = m_sizes[s-1] / 2;     // quotient
                auto sr = m_sizes[s-1] % 2;     // rest
                auto sn = m_sizes[s] = sq+sr;
                m_datas[s] = m_alloc.allocate(sn * sizeof(value_type*));
                for (size_t i = 0; i < sq; i++) { // for each dyadic reduction
                    m_datas[s][i] = pnw::arithmetic_mean(m_datas[s-1][2*i+0],
                                                         m_datas[s-1][2*i+1]);
                }
                if (sr) {       // if rest
                    m_datas[s][sq] = m_datas[s-1][2*sq+0] / 2; // extrapolate with zeros
                }
            }
        }
    }

    /// Delete Pyramid.
    void delete_pyramid() {
        if (m_datas) {        // if no multi-scala given yet1
            const size_t snum = scale_count();
            for (size_t s = 1; s < snum; s++) { // for each scale
                m_alloc.deallocate(m_datas[s], sizeof(value_type)); // clear level
            }
            delete[] m_datas; m_datas = nullptr; // deallocate scale pointers
            delete[] m_sizes; m_sizes = nullptr; // deallocate scale pointers
        }
    }

    /// Reconstruct Pyramid.
    void reconstruct_pyramid(size_t scale = 0) {
        delete_pyramid();
        construct_pyramid();
    }

private:
    std::vector<value_type, _Alloc> m_bot; ///< Bottom Resolutions.
    mutable value_type** m_datas; ///< Pyramid Resolutions Datas (Cache). Slaves under @c m_bot.
    mutable size_type* m_sizes; ///< Pyramid Resolution Lengths. Slaves under @c m_bot.
    _Alloc m_alloc;
};

}

and code for custom allocator AlignmentAllocator in allocators.hpp follows:

/*!
 * @file: allocators.hpp
 * @brief: Custom Allocators.
 * @author: Copyright (C) 2009 Per Nordlöw ([email protected])
 * @date: 2009-01-12 16:42
 * @see http://ompf.org/forum/viewtopic.php?f=11&t=686
 * On Windows use @c _aligned_malloc_() and @c _aligned_free_().
 */

#pragma once

#include <cstdlib>              // @c size_t
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
#  include <malloc.h>           // @c memalign()
#elif defined (__GNUC__)        // GNU
#  include <malloc.h>           // @c memalign()
#else                           // Rest
#endif

/*!
 * Allocator with Specific @em Alignment.
 */
template <typename _Tp, std::size_t N = 16>
class AlignmentAllocator
{
public:
    typedef _Tp value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;

    typedef _Tp * pointer;
    typedef const _Tp * const_pointer;

    typedef _Tp & reference;
    typedef const _Tp & const_reference;

public:
    inline AlignmentAllocator () throw () { }

    template <typename T2>
    inline AlignmentAllocator (const AlignmentAllocator<T2, N> &) throw () { }

    inline ~AlignmentAllocator () throw () { }

    inline pointer adress (reference r) { return &r; }

    inline const_pointer adress (const_reference r) const { return &r;
    }

    inline pointer allocate (size_type n)
    {
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
        return (pointer)memalign(N, n*sizeof(value_type));
#elif defined (__GNUC__)        // GNU
        return (pointer)memalign(N, n*sizeof(value_type));
#else  // Rest
        return (pointer)_mm_malloc (n*sizeof(value_type), N);
#endif
    }

    inline void deallocate (pointer p, size_type)
    {
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Window
        return free(p);
#elif defined (__GNUC__)        // GNU
        return free(p);
#else  // Rest
        _mm_free (p);
#endif
    }

    inline void construct (pointer p, const value_type & wert) { new (p) value_type (wert); }

    inline void destroy (pointer p) { p->~value_type (); }

    inline size_type max_size () const throw () { return size_type (-1) / sizeof (value_type);     }

    template <typename T2>
    struct rebind { typedef AlignmentAllocator<T2, N> other; };
};
like image 335
Nordlöw Avatar asked Jul 06 '11 14:07

Nordlöw


3 Answers

Since you're using vectorization, I assume this is an optimization and that these are large arrays. In that case, why not use VirtualAlloc and get your arrays in multiples of 64k guaranteed to be aligned on a 64k boundary? Example:

template<class T> T* getBigAlignedArray(unsigned count) {
    return ((T*) VirtualAlloc(NULL, sizeof(T)*count, (MEM_RESERVE | MEM_COMMIT), PAGE_READWRITE));
};
template<class T> void freeBigAlignedArray(T* pThing) {
    VirtualFree((LPVOID) pThing, 0, MEM_RELEASE);
};

seems a little more transparent to me.

like image 167
sqykly Avatar answered Nov 19 '22 13:11

sqykly


Could your answer be C++11 scoped_allocator?

This allows you to pass a stateful allocator to the elements as well as a vector. Use the same custom allocator for m_bot, m_datas, m_sizes, and for value_type.

Or maybe I'm nuts and value_type doesn't get/need an allocator.

like image 43
emsr Avatar answered Nov 19 '22 12:11

emsr


Maybe you should define your own allocator to replace the default allocator so you can control the whole memory layout on your own.

like image 1
yangrenyong Avatar answered Nov 19 '22 13:11

yangrenyong