Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the approach to implement operator[] / iterator to deal with performance issue?

Tags:

c++

Let's say that I have a custom container MyList like following.

template<class T>
class MyList
{
public:
    const T& get (size_t index) const
    {
        Block& block = getBlock (index);
        return block.get (index - block.startIndex);
    }
    void set (size_t index, const T& value)
    {
        Block& block = getBlock (index);
        block.dirty = true;
        block.set (index - block.startIndex, value);
    }
    // ...
};

Block is basically a cached block of data loaded from disk (there can be 2-3 such LRU blocks). If a block is dirty, when it is not memory, then its data would be written to disk.

My questions are the following.

  1. Overloading operator[] does not get the correct operator[] used. I overloaded the following two:

     T& operator[](size_t index)
     {
         Block& block = getBlock (index);
         block.dirty = true;
         return block.get (index - block.startIndex);
     }
     const T& operator[](size_t index) const
     {
         return get (index);
     }
    

However, it appears that g++ calls the first one if the MyList is not a const version. That is, for a code like the following:

MyList<int> list;
// ...
int myValue = list[0];

The operator[] called is always the first one which would set the dirty bit. This is not what I want.

Did I do something wrong? While I could certainly call get() or use const_iterator, my concern is that the user of the MyList would use operator[] in undesired situations. Is there a better way of operator overloading that does not have this issue?

  1. What about iterators? I had the similar issue for operator* in iterators. That is the first one is always called if the iterator itself is not const.
T& operator* ();
const T& operator* () const;
  1. My goal is to run C++ STL algorithms such as sort / nth_element / etc on this container. If we consider the search phase and swap phase of C++ STL algorithms, it appears to me that there is a potential for faster sorting if the correct iterator is used, if data is mostly sorted already.
MyList<int> list;
// ...
std::sort (list.begin (), list.end ());

Is there a way to minimize the number of blocks with dirty bit being set if the data is not changed?

Can you provide any pointers for what to look at / follow / learn? Or is this a limitation of the language or compiler?

like image 742
user1456982 Avatar asked Jan 13 '21 21:01

user1456982


People also ask

How is iterator implemented in C++?

An iterator is an object that points to an element inside a container. Like a pointer, an iterator can be used to access the element it points to and can be moved through the content of the container. Each container in the C++ Standard Library provides its own iterator, as well as some methods to retrieve it.

Why do we use iterator in c++?

An iterator is used to point to the memory address of the STL container classes. For better understanding, you can relate them with a pointer, to some extent. Iterators act as a bridge that connects algorithms to STL containers and allows the modifications of the data present inside the container.

What is c++ iterator?

An iterator is an object that can iterate over elements in a C++ Standard Library container and provide access to individual elements.

Are iterators slow?

The iterator loop is the slowest, and the difference between for loop and while loop isn't that significant.


2 Answers

The operator is called as per the standard.

To understand why: (with very few exceptions) all (sub)expressions in C++ evaluate exactly the same regardless of the enclosing (full)expression.

This means that the call list[0] is exactly the same in: int a = list[0] and list[0] = 11. So when choosing the overload for list[0] the surrounding expressions are completely ignored. The only things that are taken into account are the type of list and the type of 0. Since list is non-const and a non-const overload with a perfect match is provided, the non-const overload is chosen.

So the problem in your case is that when you call a method on a non-const object there is no way to know if that will further be used to read or write.

Thankfully there are a few solutions to this problem:

Return a proxy

Instead of returning a reference to the element, return a proxy object. Set the dirty bit only when trying to write through that proxy object. This same strategy can be extended to work with iterators.

Provide a const view

Provide something like std::span:

std::span<const T> as_const_view() const;

This will implicitly enable the iterators you want. The downside is that the user needs to be aware of this quirk of your class and avoid calling operator[] for read access.

Don't provide these kinds of overloads

Stick to get and set and avoid confusion.

like image 199
bolov Avatar answered Nov 15 '22 04:11

bolov


Block is basically a cached block of data loaded from disk (there can be 2-3 such LRU blocks).

Assuming that you're using an operating system: The operating system already does all this for you. You already pay the cost of doing a variant of LRU replacement, buffer space management, all that - it's done. So you shouldn't be doing any of it yet again - what's the point? If anything, to help with prefetch, you should tell the OS what file location(s) you're anticipating to use next - if you have this information due to how your list is used. Other than that, just map the file to memory and synchronize it to disk forcibly if you think you need that for data integrity reasons. On POSIX/Linux platforms, you'd use msync for that, and there's a Windows equivalent as well.

The operator[] called is always the first one which would set the dirty bit. This is not what I want.

The correct operator is called per C++ spec, so you do in fact "want" it even if you don't know that you do. If you think a bit about what purpose the const method overload resolution serves, you'll realize that doing it any other way would irreparably break the language.

If anything, you shouldn't be setting the dirty bit until something in that block actually changes, i.e. the block should be offering getter and setter methods to change its contents. An alternative, in your existing approach, is to allocate the memory for the block in a individually mapped memory page(s), load the block from disk, then set the pages write-protected. Genuine write access will trigger a signal (on Unix) or a similar mechanism on Windows, and you can then unprotect the page, set its dirty bit, and resume the program.

But that's utterly moot once you memory-map the file these blocks come from: the operating system does all the dirty-bit generation for you!

like image 34
Kuba hasn't forgotten Monica Avatar answered Nov 15 '22 03:11

Kuba hasn't forgotten Monica