Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

better way to ensure that a string contains only alphanumeric characters?

Tags:

c++

I was looking for a trivial way to check if a type std::string contains only alphanumeric characters. My code looks like the following:

   std::string sStr("This is a test");
   for (std::string::const_iterator s = sStr.begin(); s != sStr.end(); ++s)
              if (! isalnum(*s)) return false;

Is this the right way of doing this? Is there a more efficient way of handling the problem?

like image 852
randombits Avatar asked Dec 10 '22 17:12

randombits


2 Answers

I would use std::find_if:

std::string sStr("This is a test");

auto it = std::find_if(std::begin(sStr), std::end(sStr), [](unsigned char c) {
    return !std::isalnum(c);
});

return it == std::end(sStr);

std::find_if_not maybe would make more sense:

auto it = std::find_if_not(std::begin(sStr), std::end(sStr), [](unsigned char c) {
    return std::isalnum(c);
});
like image 181
Rakete1111 Avatar answered Jan 18 '23 23:01

Rakete1111


Yes, actually*. The looping constructs found in <algorithm> appear to perform slightly better than a raw loop under optimization, at least with gcc.

Using <algorithm> and a lambda is a nice way of doing this, at first glance:

bool onlyAlnum(const std::string& str){
    return std::all_of(
        str.begin(), 
        str.end(), 
        [loc = std::locale{}](char c){return std::isalnum(c, loc);});
}

However this has its disadvantages.

locale: the locale version of isalnum seemed to be quite a bit slower than the <cctype> incarnation of the function when I went to test it. the cctype version doesn't pay attention to locale, but testing single char for whether they're alphanumeric is going to work for a tiny, tiny subset of UTF-8 characters anyway: UTF-8 is a variable-width encoding, and testing a part of a multi-char character will result in a false negative for a test of alphanumerism.

The lambda above is a c++14 lambda, which initializes a variable loc to hold the locale when it's first created. This allows the function to work dependent on the present locale, while also preventing the cost of constructing a new object to represent the locale every time the predicate is evaluated, as would happen with a lambda like:

[](char c){return std::isalnum(c, std::locale{});}

However, it's still a very slow test, in comparison. If we don't need the limited advantages of the <locale> incarnation of std::isalnum, we can use the (much) faster <cctype> version:

[](char c){return std::isalnum(c);}

So we come to a second way of doing this, which uses the cctype version instead. Testing shows this to be much faster, on par with the raw loop you gave:

bool onlyAlnumAllOf(const std::string& str){
    return std::all_of(
        str.begin(), 
        str.end(), 
        [](char c){return std::isalnum(c);});
}

all_of tests whether a condition is valid for every entry in a range of input iterators. The range is supplied by the first two arguments, here str.begin() and str.end(), which naturally define the start and end of the string.

and a demo on coliru shows that onlyAlNum will return true for any string containing only characters from the alphabet or numbers, though not anything containing a space.

In the end, you can test the difference. With a cursory test evaluating "oyn3478qo47nqooina7o8oao7nroOL" 1000000 times, these are the results:

MinGW-64 port of gcc 5.2.0 on my machine

g++ main.cpp -Wall -Wextra -Wpedantic --std=c++14 -o3

all_of (with locale information): 652ms for 1000000 iterations
all_of: 63ms for 1000000 iterations
find_if: 63ms for 1000000 iterations
loop: 70ms for 1000000 iterations
range-loop: 69ms for 1000000 iterations

and coliru with gcc 6.1.0:

g++ main.cpp -Wall -Wextra -Wpedantic --std=c++14 -o3

all_of (with locale information): 1404ms for 1000000 iterations
all_of: 101ms for 1000000 iterations
find_if: 110ms for 1000000 iterations
loop: 108ms for 1000000 iterations
range-loop: 119ms for 1000000 iterations

and clang 3.8.0 on coliru:

clang++ -std=c++14 -O3 -Wall -Wextra -Wpedantic main.cpp

all_of (with locale information): 1127ms for 1000000 iterations
all_of: 85ms for 1000000 iterations
find_if: 72ms for 1000000 iterations
loop: 128ms for 1000000 iterations
range-loop: 88ms for 1000000 iterations

As you can see, it varies by compiler and version which function is the fastest. Isn't optimization fun?

this is the function I used to test each method:

using StrEvaluator = bool (*)(const std::string&);
using Milliseconds = std::chrono::milliseconds;

void testStrEvaluator(StrEvaluator eval, std::string str){
    auto begin = std::chrono::steady_clock::now();
    bool result = true;
    for(unsigned int i = 0; i < 1000000; ++i){
        str.resize(str.size());
        result &= eval(str);
    }
    auto end = std::chrono::steady_clock::now();
    std::cout
        << std::chrono::duration_cast<Milliseconds>(end - begin).count()
        << "ms for 1000000 iterations\n";
}

There are flaws with the testing: coliru doesn't provide guarantees regarding consistent resources during execution, and I didn't shut down other programs on my computer, so the variations could be a fluke. However, they seem consistent enough to draw a few conclusions from: both the looping constructs from algorithm and raw loops can perform quite well, and choosing between them based on speed (unless you've discovered the loop is a bottleneck) is more micro-optimization than anything else.

like image 25
jaggedSpire Avatar answered Jan 18 '23 22:01

jaggedSpire