Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the efficient way of sorting columnar data

In my program, I need an efficient way of sorting in-memory columnar data.

Let me explain this problem. The data consists of four objects:

(1, 15, ‘apple’), (9, 27, ‘pear’),(7, 38, ‘banana’),(4, 99, ‘orange’)

And the four objects are kept in memory in a columnar, and looks like this:

[1,9,7,4],[15, 27, 38, 99],[‘apple’,’pear’,’banana’,’orange’]

I need to sort this list according to the second column in ascending order and the third in descending order.

In the case of only one columnar, it is a simple sorting problem.

But the case is different when two or more columns exist for in-memory columnar data. The swap function may incur too many overheads during sorting columnar data. I’ve checked several Open-sourced implementations to find the best practice, e.g., Apache Arrow, Presto, TDengine, and other projects. And I found that using index sort is a way that can avoid the overhead introduced by swapping since the index, instead of columnar data, will be swapped.

And I’m wondering is the index sort the most efficient means to handle this problem?

like image 366
Roony Avatar asked Sep 17 '25 20:09

Roony


1 Answers

If you want speed then fastest language is C++.

You can use std::sort for sorting purposes with parallel execution policy std::execution::par_unseq (which enables multi-threaded multi-core parallel sort).

As you can notice in my code below, I did arg-sort, because you wished for it. But in C++ it is not really necessary, it is enough to do regular sort, for two reasons.

One is because of cache locality, swapping data elements themselves instead of indices is faster because sorting algorithms are often more or less cache friendly, meaning that swapping/comparing nearby-in-memory elements happens often.

Second is because swapping of elements in std::sort is done through std::swap, which in turn uses std::move, this move function swaps classes like std::string very efficiently by just swapping pointers to data instead of copying data itself.

From above it follows that doing arg-sort instead of regular sort might be even slower.

Following code snippet first creates small example file with several data tuples that you provided. In real case you should remove this file writing code so that you don't overwrite your file. At the end it outputs to new file.

After program finishes see created files data.in and data.out.

Try it online!

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <tuple>
#include <execution>
#include <algorithm>

int main() {
    {
        std::ofstream f("data.in");
        f << R"(
            7, 38, banana
            9, 27, pear
            4, 99, orange
            1, 15, apple
        )";
    }
    std::vector<std::tuple<int, int, std::string>> data;
    {
        std::ifstream f("data.in");
        std::string c;
        while (true) {
            int a = 0, b = 0;
            char comma = 0;
            c.clear();
            f >> a >> comma >> b >> comma >> c;
            if (c.empty() && !f)
                break;
            data.push_back({a, b, c});
        }
    }
    std::vector<size_t> idx(data.size());
    for (size_t i = 0; i < idx.size(); ++i)
        idx[i] = i;
    std::sort(std::execution::par_unseq, idx.begin(), idx.end(),
        [&data](auto const & i, auto const & j){
            auto const & [_0, x0, y0] = data[i];
            auto const & [_1, x1, y1] = data[j];
            if (x0 < x1)
                return true;
            else if (x0 == x1)
                return y0 > y1;
            else
                return false;
        });
    {
        std::ofstream f("data.out");
        for (size_t i = 0; i < idx.size(); ++i) {
            auto const & [x, y, z] = data[idx[i]];
            f << x << ", " << y << ", " << z << std::endl;
        }
    }
}

Input:

7, 38, banana
9, 27, pear
4, 99, orange
1, 15, apple

Output:

1, 15, apple
9, 27, pear
7, 38, banana
4, 99, orange
like image 90
Arty Avatar answered Sep 21 '25 04:09

Arty