I'm trying to split a massive <code>QByteArray</code> which contains UTF-8 encoded plain text(using whitespace as delimiter) with the best performance possible. I found that I can achieve much better results if I convert the array to <code>QString</code> first. I tried using the <code>QString.split</code> function using a regexp, but the performance was horrendous. This code turned out to be way faster: <pre class="prettyprint"><code>QMutex mutex; QSet<QString> split(QByteArray body) { QSet<QString> slova; QString s_body = QTextCodec::codecForMib(106)->toUnicode(body); QString current; for(int i = 0; i< body.size(); i++){ if(s_body[i] == '\r' || s_body[i] == '\n' || s_body[i] == '\t' || s_body[i] == ' '){ mutex.lock(); slova.insert(current); mutex.unlock(); current.clear(); current.reserve(40); } else { current.push_back(s_body[i]); } } return slova; } </code></pre> "Slova" is a <code>QSet<QString></code> currently, but I could use a <code>std::set</code> or any other format. This code is supposed to find how many unique words there are in the array, with the best performance possible. Unfortunately, this code runs far from fast enough. I'm looking to squeeze the absolute maximum out of this. Using callgrind, I found that the most gluttonous internal functions were: <pre class="prettyprint"><code>QString::reallocData (18% absolute cost) QString::append (10% absolute cost) QString::operator= (8 % absolute cost) QTextCodec::toUnicode (8% absolute cost) </code></pre> Obviously, this has to do with memory allocation stemming from the <code>push_back</code> function. What is the most optimal way to solve this? Doesn't necessarily have to be a Qt solution - pure C or C++ are also acceptable.

Minimise the amount of copying you need to do. Keep the input buffer in UTF-8, and don't store <code>std::string</code> or <code>QString</code> in your set; instead, create a small class to reference the existing UTF-8 data: <pre class="prettyprint"><code>#include <QString> class stringref { const char *start; size_t length; public: stringref(const char *start, const char *end); operator QString() const; bool operator<(const stringref& other) const; }; </code></pre> This can encapsulate a substring of the UTF-8 input. You'll need to ensure that it doesn't outlive the input string; you could do this by clever use of <code>std::shared_ptr</code>, but if the code is reasonably self-contained, then it should be tractable enough to reason about the lifetime. We can construct it from a pair of pointers into our UTF-8 data, and convert it to <code>QString</code> when we want to actually use it: <pre class="prettyprint"><code>stringref::stringref(const char *start, const char *end) : start(start), length(end-start) {} stringref::operator QString() const { return QString::fromUtf8(start, length); } </code></pre> You need to define <code>operator<</code> so you can use it in a <code>std::set</code>. <pre class="prettyprint"><code>#include <cstring> bool stringref::operator<(const stringref& other) const { return length == other.length ? std::strncmp(start, other.start, length) < 0 : length < other.length; } </code></pre> Note that we sort by length before dereferencing pointers, to reduce cache impact. <hr> Now we can write the <code>split</code> method: <pre class="prettyprint"><code>#include <set> #include <QByteArray> std::set<stringref> split(const QByteArray& a) { std::set<stringref> words; // start and end const auto s = a.data(), e = s + a.length(); // current word auto w = s; for (auto p = s; p <= e; ++p) { switch (*p) { default: break; case ' ': case '\r': case '\n': case '\t': case '\0': if (w != p) words.insert({w, p}); w = p+1; } } return words; } </code></pre> The algorithm is pretty much yours, with the addition of the <code>w!=p</code> test so that runs of whitespace don't get counted. <hr> Let's test it, and time the important bit: <pre class="prettyprint"><code>#include <QDebug> #include <chrono> int main() { QByteArray body{"foo bar baz\n foo again\nbar again "}; // make it a million times longer for (int i = 0; i < 20; ++i) body.append(body); using namespace std::chrono; const auto start = high_resolution_clock::now(); auto words = split(body); const auto end = high_resolution_clock::now(); qDebug() << "Split" << body.length() << "bytes in" << duration_cast<duration<double>>(end - start).count() << "seconds"; for (auto&& word: words) qDebug() << word; } </code></pre> I get: <blockquote> Split 35651584 bytes in 1.99142 seconds "bar" "baz" "foo" "again" </blockquote> Compiling with <code>-O3</code> reduced that time to 0.6188 seconds, so don't forget to beg the compiler for help! If that's still not fast enough, it's probably time to start to look at parallelising the task. You'll want to split the string into roughly equal lengths, but advance to the next whitespace so that no work straddles two threads worth of work. Each thread should create its own set of results, and the reduction step is then to merge the result sets. I won't provide a full solution for this, as that's another question in its own right.

C++, Qt - splitting a QByteArray as fast as possible

Tags:

c++

performance

split

qt

qbytearray

I'm trying to split a massive QByteArray which contains UTF-8 encoded plain text(using whitespace as delimiter) with the best performance possible. I found that I can achieve much better results if I convert the array to QString first. I tried using the QString.split function using a regexp, but the performance was horrendous. This code turned out to be way faster:

QMutex mutex;
QSet<QString> split(QByteArray body)
{
    QSet<QString>  slova;

    QString s_body = QTextCodec::codecForMib(106)->toUnicode(body);
    QString current;

    for(int i = 0; i< body.size(); i++){
        if(s_body[i] == '\r' || s_body[i] == '\n' || s_body[i] == '\t' || s_body[i] == ' '){

            mutex.lock();
            slova.insert(current);

            mutex.unlock();
            current.clear();
            current.reserve(40);
        } else {
            current.push_back(s_body[i]);
        }
    }
    return slova;
}

"Slova" is a QSet<QString> currently, but I could use a std::set or any other format. This code is supposed to find how many unique words there are in the array, with the best performance possible.

Unfortunately, this code runs far from fast enough. I'm looking to squeeze the absolute maximum out of this.

Using callgrind, I found that the most gluttonous internal functions were:

QString::reallocData (18% absolute cost)
QString::append (10% absolute cost)
QString::operator= (8 % absolute cost)
QTextCodec::toUnicode (8% absolute cost)

Obviously, this has to do with memory allocation stemming from the push_back function. What is the most optimal way to solve this? Doesn't necessarily have to be a Qt solution - pure C or C++ are also acceptable.

826

asked Apr 28 '16 21:04

user129186

2 Answers

Minimise the amount of copying you need to do. Keep the input buffer in UTF-8, and don't store std::string or QString in your set; instead, create a small class to reference the existing UTF-8 data:

#include <QString>

class stringref {
    const char *start;
    size_t length;

public:
    stringref(const char *start, const char *end);
    operator QString() const;
    bool operator<(const stringref& other) const;
};

This can encapsulate a substring of the UTF-8 input. You'll need to ensure that it doesn't outlive the input string; you could do this by clever use of std::shared_ptr, but if the code is reasonably self-contained, then it should be tractable enough to reason about the lifetime.

We can construct it from a pair of pointers into our UTF-8 data, and convert it to QString when we want to actually use it:

stringref::stringref(const char *start, const char *end)
    : start(start), length(end-start)
{}

stringref::operator QString() const
{
    return QString::fromUtf8(start, length);
}

You need to define operator< so you can use it in a std::set.

#include <cstring>
bool stringref::operator<(const stringref& other) const
{
    return length == other.length
        ? std::strncmp(start, other.start, length) < 0
        : length < other.length;
}

Note that we sort by length before dereferencing pointers, to reduce cache impact.

Now we can write the split method:

#include <set>
#include <QByteArray>
std::set<stringref> split(const QByteArray& a)
{
    std::set<stringref> words;

    // start and end
    const auto s = a.data(), e = s + a.length();

    // current word
    auto w = s;

    for (auto p = s;  p <= e;  ++p) {
        switch (*p) {
        default: break;
        case ' ': case '\r': case '\n': case '\t': case '\0':
            if (w != p)
                words.insert({w, p});
            w = p+1;
        }
    }

    return words;
}

The algorithm is pretty much yours, with the addition of the w!=p test so that runs of whitespace don't get counted.

Let's test it, and time the important bit:

#include <QDebug>
#include <chrono>
int main()
{
    QByteArray body{"foo bar baz\n  foo again\nbar again "};
    // make it a million times longer
    for (int i = 0;  i < 20;  ++i)
        body.append(body);

    using namespace std::chrono;
    const auto start = high_resolution_clock::now();

    auto words = split(body);

    const auto end = high_resolution_clock::now();
    qDebug() << "Split"
             << body.length()
             << "bytes in"
             << duration_cast<duration<double>>(end - start).count()
             << "seconds";

    for (auto&& word: words)
        qDebug() << word;
}

I get:

Split 35651584 bytes in 1.99142 seconds
"bar"
"baz"
"foo"
"again"

Compiling with -O3 reduced that time to 0.6188 seconds, so don't forget to beg the compiler for help!

If that's still not fast enough, it's probably time to start to look at parallelising the task. You'll want to split the string into roughly equal lengths, but advance to the next whitespace so that no work straddles two threads worth of work. Each thread should create its own set of results, and the reduction step is then to merge the result sets. I won't provide a full solution for this, as that's another question in its own right.

115

answered Sep 16 '22 22:09

Toby Speight

Your largest cost, as suspected, is in push_back causing frequent reallocations as you append one character at a time. Why not search ahead, then append all of the data at once using QString::mid():

slova.insert(s_body.mid(beginPos, i - beginPos - 1));

Where beginPos holds the index of the start of the current substring. Instead of appending each character to current before it is inserted into slova, the copy happens all at once. After copying a substring, search ahead for the next valid (not a separator) character and set beginPos equal to that index.

In (rough) code:

QString s_body = ...
//beginPos tells us the index of the current substring we are working 
//with. -1 means the previous character was a separator
int beginPos = -1;
for (...) {
    //basically your if statement provided in the question as a function
    if (isSeparator(s_body[i])) {
         //ignore double white spaces, etc.
         if (beginPos != -1) {
             mutex.lock();
             slova.insert(s_body.mid(beginPos, i - beginPos - 1));
             mutex.unlock();
         }
    } else if (beginPos == -1)
        //if beginPos is not valid and we are not on a separator, we 
        //are at the start of a new substring.
         beginPos = i;
}

This approach will drastically reduce your overhead in heap allocations and eliminate QString::push_back() calls.

One final note: QByteArray also provides a mid() function. You can skip the conversion to QString entirely and work directly with the byte array.

answered Sep 18 '22 22:09

jonspaceharper

Related questions
                            
                                C++ Pointer within parentheses
                            
                                Why is this "call" to assert seeing two arguments instead of just the one?
                            
                                Extend call stack to disk in C++?
                            
                                Difference between std::logical_not and std::not1?
                            
                                How to install METIS on ubuntu
                            
                                Can C++ enable_if have a default implementation?
                            
                                How can I use `std::array` for a template parameter of the form `template<typename> class`?
                            
                                Will an exception thrown from a noexcept function parameter's constructor immediately result in a call to std::terminate()?
                            
                                What is the difference between initializing with = and initializing with {}?
                            
                                Why does cout.setf(ios::fixed) change my floats to hexadecimal?
                            
                                How to dock a pop-out window?
                            
                                How does an Antivirus knows not to detect non malicious code?
                            
                                pcl::PCLPointCloud2 usage
                            
                                Visual leak detector (VLD) displays empty call stack
                            
                                Inheritence and copy constructor - how to init private fields from base class?
                            
                                Access member of derived class from pointer of base class
                            
                                c++ reading file is too slow
                            
                                Better pattern for partial specialization disambiguation precedence chain?
                            
                                Is it safe to return from function before all std::futures are finished?
                            
                                With C++ lambdas, what are the rules for capturing references by reference?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With