Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting std::strings with numbers in them?

I'm currently sorting by the std::string < operator. The problem with it is that:

30 < 9. The 30 shows up before the 9 since 3 < 9, Windows 9x had this issue to. How could I go about sorting them numerically so that "30 Foxes" shows up after "9 dogs". I should also add that I'm using utf 8 encoding.

Thanks

like image 900
jmasterx Avatar asked Jan 07 '11 04:01

jmasterx


3 Answers

You can create a custom comparison function to use with std::sort. This function would have to check if the string begins with a numeric value. If it does, convert the numeric part of each string to an int using some mechanism like a stringstream. Then compare the two integer values. If the values compare equally, compare the non-numeric part of the strings lexicographically. Otherwise, if the strings don't contain a numeric part, simply compare the two strings lexicographically as normal.

Basically, something like the following (untested) comparison function:

bool is_not_digit(char c)
{
    return !std::isdigit(c);
}

bool numeric_string_compare(const std::string& s1, const std::string& s2)
{
    // handle empty strings...

    std::string::const_iterator it1 = s1.begin(), it2 = s2.begin();

    if (std::isdigit(s1[0]) && std::isdigit(s2[0])) {
        int n1, n2;
        std::stringstream ss(s1);
        ss >> n1;
        ss.clear();
        ss.str(s2);
        ss >> n2;

        if (n1 != n2) return n1 < n2;

        it1 = std::find_if(s1.begin(), s1.end(), is_not_digit);
        it2 = std::find_if(s2.begin(), s2.end(), is_not_digit);
    }

    return std::lexicographical_compare(it1, s1.end(), it2, s2.end());
}

And then...

std::sort(string_array.begin(), string_array.end(), numeric_string_compare);

EDIT: Of course, this algorithm is only useful if you're sorting strings where the numeric portion appears at the beginning of the string. If you're dealing with strings where the numeric portion can appear anywhere in the string, then you need a more sophisticated algorithm. See http://www.davekoelle.com/alphanum.html for more information.

like image 76
Charles Salvia Avatar answered Oct 12 '22 13:10

Charles Salvia


If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details.

Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.

like image 23
kbjorklu Avatar answered Oct 12 '22 13:10

kbjorklu


Here's a version that doesn't convert to integer and thus works for long strings of digits regardless of sizeof(int).

#include <cctype>
#include <cstddef>
#include <cstring>
#include <string>

int numcmp(const char *a, const char *aend, const char *b, const char *bend)
{
  for (;;) {
    if (a == aend) {
      if (b == bend)
        return 0;
      return -1;
    }
    if (b == bend)
      return 1;
    if (*a == *b) {
      ++a, ++b;
      continue;
    }
    if (!isdigit((unsigned char) *a) || !isdigit((unsigned char) *b))
      return *a - *b;

    // skip leading zeros in both strings
    while (*a == '0' && ++a != aend)
      ;
    while (*b == '0' && ++b != aend)
      ;

    // skip to end of the consecutive digits
    const char *aa = a;
    while (a != aend && isdigit((unsigned char) *a))
      ++a;
    std::ptrdiff_t alen = a - aa;

    const char *bb = b;
    while (b != bend && isdigit((unsigned char) *b))
      ++b;
    std::ptrdiff_t blen = b - bb;

    if (alen != blen)
      return alen - blen;

    // same number of consecutive digits in both strings
    while (aa != a) {
      if (*aa != *bb)
        return *aa - *bb;
      ++aa, ++bb;
    }
  }
}

int numcmp(const std::string& a, const std::string& b)
{
  return numcmp(a.data(), a.data() + a.size(),
                b.data(), b.data() + b.size());
}

int numcmp(const char *a, const char *b)
{
  return numcmp(a, a + strlen(a), b, b + strlen(b));
}
like image 42
Waxrat Avatar answered Oct 12 '22 12:10

Waxrat