Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting a string with std::sort so that capital letters come after lower case

I'd like to sort a vector so that the capital letters follow the lower case letter. If I have something like

This is a test
this is a test
Cats
cats
this thing

I would like the output to be

cats
Cats
this is a test
This is a test
this thing

The standard library sort will output

Cats
This is a test
cats
this is a test
this thing

I want to pass a predicate to std::sort so that it compares the lowercase version of the strings that I pass as arguments.

bool compare(std::string x, std::string y)
{
    return lowercase(x) < lowercase(y);
}

I tried lowering each character within the function and then making the comparison but it didn't work. I would like to test this approach by converting the string to lowercase by some other method. How do I convert strings into lowercase?

EDIT::

Actually I figured out the problem. This works. When I first wrote the function, instead of ref = tolower(ref) I had tolower(ref) without reassigning to ref so it wasn't doing anything.

bool compare(std::string x, std::string y)
{
    for(auto &ref:x)
        ref = tolower(ref);
    for(auto &ref:y)
        ref = tolower(ref);
    return x < y;
}

EDIT::

This code actually sorts with the capital letter first sometimes and the capital letter second in other times so it doesn't solve the problem completely.

like image 846
Mars Avatar asked Oct 22 '13 04:10

Mars


1 Answers

The usual way to do this would be to build a collation table. That's just a table giving the relative ordering of every character. In your case, you want each upper-case letter immediately following the corresponding lower-case letter.

We can do that something like this:

class comp_char { 
    std::vector<int> collation_table;
public:
    comp_char() : collation_table(std::numeric_limits<unsigned char>::max()) {
        std::iota(collation_table.begin(), collation_table.end(), 0);

        for (int i = 0; i < 26; i++) {
            collation_table['a' + i] = i * 2;
            collation_table['A' + i] = i * 2 + 1;
        }
    }

    bool operator()(unsigned char a, unsigned char b) {
        return collation_table[a] < collation_table[b];
    }
};

For the moment, I've ignored the (possibly knotty) problem of the relative ordering of letters to other characters. As it's written, everything else sorts before letters, but it would be pretty easy to change that so (for example) letters sorted before anything else instead. It probably doesn't make a huge difference either way though -- most people don't have strong expectations about whether 'a' < ';' or not.

In any case, once the collation table is built and usable, you want to use it to compare strings:

struct cmp_str {
    bool operator()(std::string const &a, std::string const &b) {
        comp_char cmp;
        size_t i = 0;
        while (a[i] == b[i] && i < a.size())
            ++i;
        return cmp(a[i], b[i]);
    }
};

...which we can use to do sorting, something like this:

int main(){
    std::vector<std::string> inputs {
        "This is a test",
        "this is a test",
        "Cats",
        "cats",
        "this thing"
    };

    std::sort(inputs.begin(), inputs.end(), cmp_str());
    std::copy(inputs.begin(), inputs.end(),
        std::ostream_iterator<std::string>(std::cout, "\n"));
}

For the moment, I've only written the collation table to handle the basic US-ASCII letters. For real use, you'd typically want to have things like letters with accents and such sort next to their corresponding un-accented equivalents. For that, you typically end up pre-building the table to (partially) match things like the Unicode specification for how things should be ordered.

Note that this output doesn't quite match what the original question says is desired, but I think in this case the question has a mistake. I can't see any way it would be even marginally reasonable to produce an order like:

this is a test
This is a test
this thing

This has "T" sorting both after and before "t", which doesn't seem to make sense (or at least doesn't fit with a lexical sort, which is what people nearly always want for strings).

like image 75
Jerry Coffin Avatar answered Sep 19 '22 17:09

Jerry Coffin