Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?

People also ask

Which is used for convert uppercase to lowercase and vice versa?

By pressing F3, you can toggle between uppercase and lowercase.

Why do people alternate caps and lowercase?

Alternating caps are typically used to display mockery in text messages. The randomized capitalization leads to the flow of words being broken, making it harder for the text to be read as it disrupts word identification even when the size of the letters is the same as in uppercase or lowercase.

What is the meaning of both upper and lowercase letters?

What are uppercase letters? Uppercase letters are capital letters—the bigger, taller versions of letters (like W), as opposed to the smaller versions, which are called lowercase letters (like w). Uppercase means the same thing as capital. Uppercase letters can also be called capitals.

Where do the terms upper and lowercase come from?

The larger letters, the capitals, were stored in an upper case, and the smaller letters (along with the type for punctuation and spaces) were stored in a lower case, and that is why they are called uppercase and lowercase letters.


Let's take a look at ASCII code table in binary.

A 1000001    a 1100001
B 1000010    b 1100010
C 1000011    c 1100011
...
Z 1011010    z 1111010

And 32 is 0100000 which is the only difference between lowercase and uppercase letters. So toggling that bit toggles the case of a letter.


This uses the fact than ASCII values have been chosen by really smart people.

foo ^= 32;

This flips the 6th lowest bit1 of foo (the uppercase flag of ASCII sort of), transforming an ASCII upper case to a lower case and vice-versa.

+---+------------+------------+
|   | Upper case | Lower case |  32 is 00100000
+---+------------+------------+
| A | 01000001   | 01100001   |
| B | 01000010   | 01100010   |
|            ...              |
| Z | 01011010   | 01111010   |
+---+------------+------------+

Example

'A' ^ 32

    01000001 'A'
XOR 00100000 32
------------
    01100001 'a'

And by property of XOR, 'a' ^ 32 == 'A'.

Notice

C++ is not required to use ASCII to represent characters. Another variant is EBCDIC. This trick only works on ASCII platforms. A more portable solution would be to use std::tolower and std::toupper, with the offered bonus to be locale-aware (it does not automagically solve all your problems though, see comments):

bool case_incensitive_equal(char lhs, char rhs)
{
    return std::tolower(lhs, std::locale{}) == std::tolower(rhs, std::locale{}); // std::locale{} optional, enable locale-awarness
}

assert(case_incensitive_equal('A', 'a'));

1) As 32 is 1 << 5 (2 to the power 5), it flips the 6th bit (counting from 1).


Allow me to say that this is -- although it seems smart -- a really, really stupid hack. If someone recommends this to you in 2019, hit him. Hit him as hard as you can.
You can, of course, do it in your own software that you and nobody else uses if you know that you will never use any language but English anyway. Otherwise, no go.

The hack was arguable "OK" some 30-35 years ago when computers didn't really do much but English in ASCII, and maybe one or two major European languages. But... no longer so.

The hack works because US-Latin upper- and lowercases are exactly 0x20 apart from each other and appear in the same order, which is just one bit of difference. Which, in fact, this bit hack, toggles.

Now, the people creating code pages for Western Europe, and later the Unicode consortium, were smart enough to keep this scheme for e.g. German Umlauts and French-accented Vowels. Not so for ß which (until someone convinced the Unicode consortium in 2017, and a large Fake News print magazine wrote about it, actually convincing the Duden -- no comment on that) don't even exist as a versal (transforms to SS). Now it does exist as versal, but the two are 0x1DBF positions apart, not 0x20.

The implementors were, however, not considerate enough to keep this going. For example, if you apply your hack in some East European languages or the like (I wouldn't know about Cyrillic), you will get a nasty surprise. All those "hatchet" characters are examples of that, lowercase and uppercase are one apart. The hack thus does not work properly there.

There's much more to consider, for example, some characters do not simply transform from lower- to uppercase at all (they're replaced with different sequences), or they may change form (requiring different code points).

Do not even think about what this hack will do to stuff like Thai or Chinese (it'll just give you complete nonsense).

Saving a couple of hundred CPU cycles may have been very worthwhile 30 years ago, but nowadays, there is really no excuse for converting a string properly. There are library functions for performing this non-trivial task.
The time taken to convert several dozens kilobytes of text properly is negligible nowadays.


It works because, as it happens, the difference between 'a' and A' in ASCII and derived encodings is 32, and 32 is also the value of the sixth bit. Flipping the 6th bit with an exclusive OR thus converts between upper and lower.