Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Idiomatically split a string_view

I read The most elegant way to iterate the words of a string and enjoyed the succinctness of the answer. Now I want to do the same for string_view. Problem is, stringstream can't take a string_view:

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string_view sentence = "And I feel fine...";
    istringstream iss(sentence); // <== error
    copy(istream_iterator<string_view>(iss),
         istream_iterator<string_view>(),
         ostream_iterator<string_view>(cout, "\n"));
}

So is there a way to do this? If not, what is the reasoning such a thing would be not idiomatic?

like image 496
user9150413 Avatar asked Mar 08 '23 06:03

user9150413


1 Answers

Split by a delimiter and return a vector<string_view>.

Designed for rapid splitting of lines in a .csv file.

Tested under MSVC 2017 v15.9.6 and Intel Compiler v19.0 compiled with C++17 (which is required for string_view).

#include <string_view>

std::vector<std::string_view> Split(const std::string_view str, const char delim = ',')
{   
    std::vector<std::string_view> result;

    int indexCommaToLeftOfColumn = 0;
    int indexCommaToRightOfColumn = -1;

    for (int i=0;i<static_cast<int>(str.size());i++)
    {
        if (str[i] == delim)
        {
            indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
            indexCommaToRightOfColumn = i;
            int index = indexCommaToLeftOfColumn + 1;
            int length = indexCommaToRightOfColumn - index;

            // Bounds checking can be omitted as logically, this code can never be invoked 
            // Try it: put a breakpoint here and run the unit tests.
            /*if (index + length >= static_cast<int>(str.size()))
            {
                length--;
            }               
            if (length < 0)
            {
                length = 0;
            }*/

            std::string_view column(str.data() + index, length);
            result.push_back(column);
        }
    }
    const std::string_view finalColumn(str.data() + indexCommaToRightOfColumn + 1, str.size() - indexCommaToRightOfColumn - 1);
    result.push_back(finalColumn);
    return result;
}

Be careful of lifetimes: a string_view should never outlive the parent string that it is a window into. If the parent string goes out of scope, then what the string_view points to is is invalid. In this particular case, the API design makes it difficult to go wrong as it the input/output is all string_view which are all windows into the parent string. This ends up being rather efficient in terms of memory copying and CPU usage.

Note that if using string_view the only downside is losing implicit null termination. So use functions that support string_view, e.g. the lexical_cast functions in Boost for converting strings to numbers.

I used this to rapidly parse a .csv file. To get each new line in the .csv file, I used istringstream and getLine() which is blazinly fast (~2GB/second or 1,200,000 lines per second on a single core).

Unit tests. Use Google Test for testing (I installed using vcpkg).

// Google Test integrates into VS2017 if ReSharper is installed. 
#include "gtest/gtest.h" // Can install using vcpkg
// In main(), call:   
// ::testing::InitGoogleTest(&argc, argv);return RUN_ALL_TESTS();

TEST(Strings, Split)
{
    {
        const std::string str = "A,B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
    }       
    {
        const std::string str = ",B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
    }
    {
        const std::string str = "A,B,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "");
    }
    {
        const std::string str = "";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "");
    }
    {
        const std::string str =  "A";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "A");
    }
    {
        const std::string str =  ",";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
    }
    {
        const std::string str =  ",,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
        EXPECT_TRUE(tokens[2] == "");
    }
    {
        const std::string str = "A,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "");
    }
    {
        const std::string str = ",B";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
    }       
}
like image 146
Contango Avatar answered Mar 09 '23 21:03

Contango