Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I concisely find all digits in a string without using a loop?

Tags:

c++

c++11

c++14

I want to get all digits in a std::string but without using a loop (myself; what the code I'm calling uses, I don't mind). An alternative view of the request is: remove all non-digits from the string, leaving only the digits. I know that I can find all digit in a string using code like this:

std::string get_digits(std::string input) {
    std::string::size_type next_digit(0u);
    for (std::string::size_type pos(0u);
         input.npos != (pos = input.find_first_of("0123456789"));
         ++pos) {
        input[next_digit++] = input[pos];
    }
    input.resize(next_digit);
    return input;
}

However, this function uses a loop. std::string doesn't provide a function find_all() or something! Ideally, the string is maniulated in-place (the code above moves it but it is easily changed to take a reference).

When there are multiple alternatives, I'll promise to post profiling results of how good the different approaches work on some lengthy text.

like image 610
Dietmar Kühl Avatar asked Dec 01 '22 00:12

Dietmar Kühl


2 Answers

One way would be to use std::copy_if (or std::remove_if):

std::string get_digits(std::string input) {
    std::string result;
    std::copy_if(
        input.begin(), 
        input.end(), 
        std::back_inserter(result), 
        [](char c) { return '0' <= c && c <= '9'; });
    return result;
}

Obviously this uses a loop internally, but you said you don't care about that...

Edit: With std::remove_if:

std::string get_digits_remove(std::string input) {
    auto itErase = std::remove_if(
        input.begin(), 
        input.end(), 
        [](char c) { return !('0' <= c && c <= '9'); });
    input.erase(itErase, input.end());
    return input;
}
like image 88
wakjah Avatar answered Dec 05 '22 11:12

wakjah


Although I primarily had hoped for 5 quick answers (which wasn't achieved, sigh) the answers and comments led to some interesting approaches I hadn't thought of myself. My personal expectation had been that the answers effectively would result in:

  • If you want to be fast, use

    input.erase(std::remove_if(input.begin(), input.end(),
                               [](unsigned char c){ return !std::isdigit(c); }),
                input.end());
    
  • If you want to be concise, use

    text = std::regex_replace(text, std::regex(R"(\D)"), "");
    

Instead, there were a number of approaches I hadn't even considered:

  • Use a recursive function!

  • Use std::partition() which seems to require extra work (retain the characters which will be thrown out) and changes the order.

  • Use std::stable_partition() which seems to require even more work but doesn't change the order.

  • Use std::sort() and extract the substring with the relevant characters although I don't know how to make that one retain the original sequence of character. Just using a stable version doesn't quite to it.

Putting the different approaches together and using a number of variations on how to classify the characters, led to a total of 17 version of roughly the same operation (the code is on github). Most of the versions use std::remove_if() and std::string::erase() but differ in the classification of digits.

  1. remove_if() with [](char c){ return d.find(c) == d.npos; }).
  2. remove_if() with [](char c){ return std::find(d.begin(), d.end(), c) == d.end(); }
  3. remove_if() with [](char c){ return !std::binary_search(d.begin(), d.end()); }
  4. remove_if() with [](char c){ return '0' <= c && c <= '9'; }
  5. remove_if() with [](unsigned char c){ return !std::isdigit(c); } (the char is passed as unsigned char to avoid undefined behavior in case c is a char with a negative value)
  6. remove_if() with std::not1(std::ptr_fun(std::static_cast<int(*)(int)>(&std::isdigit))) (the cast is necessary to determine the correct overload: std::isdigit() happens to be overloaded).
  7. remove_if() with [&](char c){ return !hash.count(c); }
  8. remove_if() with [&](char c){ return filter[c]; } (the code initializing actually uses a loop)
  9. remove_if() with [&](char c){ return std::isidigit(c, locale); }
  10. remove_if() with [&](char c){ return ctype.is(std::ctype_base::digit, c); }
  11. str.erase(std::parition(str.begin(), str.end(), [](unsigned char c){ return !std::isdigit(c); }), str.end())
  12. str.erase(std::stable_parition(str.begin(), str.end(), [](unsigned char c){ return !std::isdigit(c); }), str.end())
  13. the "sort-approach" describe in one of the answers
  14. the copy_if() approach described in one of the answers
  15. the recursive approach describe in one of the answers
  16. text = std::regex_replace(text, std::regex(R"(\D)"), ""); (I didn't manage to get this to work on icc)
  17. like 16 but with an already built regular expression

I have run the benchmark on a MacOS notebook. Since results like this are reasonably easy to graph with Google Chars, here is a graph of the results (although with the versions using regexps removed as these would cause the graph to scale such that the interesting bit isn't really visible). The results of the benchmarks in form of a table:

    test                          clang   gcc     icc
 1  use_remove_if_str_find        22525   26846  24815
 2  use_remove_if_find            31787   23498  25379
 3  use_remove_if_binary_search   26709   27507  37016
 4  use_remove_if_compare          2375    2263   1847
 5  use_remove_if_ctype            1956    2209   2218
 6  use_remove_if_ctype_ptr_fun    1895    2304   2236
 7  use_remove_if_hash            79775   60554  81363
 8  use_remove_if_table            1967    2319   2769
 9  use_remove_if_locale_naive    17884   61096  21301
10  use_remove_if_locale           2801    5184   2776
11  use_partition                  1987    2260   2183
12  use_stable_partition           7134    4085  13094
13  use_sort                      59906  100581  67072
14  use_copy_if                    3615    2845   3654
15  use_recursive                  2524    2482   2560
16  regex_build                  758951  531641 
17  regex_prebuild               775450  519263
like image 27
Dietmar Kühl Avatar answered Dec 05 '22 09:12

Dietmar Kühl