It turns out that uppercasing a character is a complicated business. If you get out of the basic ASCII character set, the rules for uppercasing a character and lowercasing a character are actually dependent on the locale in which the application is running.
As a demo application, I am attempting to uppercase the letter 'i' (with a dot) and the letter 'i' (without a dot). Now, in en_US, 'i' (with a dot) uppercases to 'I', and 'i' (without a dot) doesn't exist (but still uppercases to 'I').
But, if I switch to Turkish (tr_TR.UTF-8), 'i' (with a dot) must uppercase to 'İ' (also with a dot) and 'ı' (without a dot) must uppercase to 'I' (also without a dot). Lowercase should reverse these operations.
iİıI --> İİII (tr_TR.UTF-8)
iİıI --> IİII (en_US.UTF-8)
Now, I can do this perfectly in C. How can I do it in Haskell? All of the searches that I do point me directly to Data.Char.toUpper, which is not locale-aware. I haven't found any functions that are locale-aware in any way.
Here's a code sample from C. I run it on my Linux machine.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wctype.h>
#include <string.h>
#include <errno.h>
wchar_t latin_small_sharp_s[5] = {0x00df, 0x00df, 0x0053, 0x0053, 0};
wchar_t turkish_is[5] = {0x0069, 0x0130, 0x0131, 0x0049, 0};
char multibyte_turkish_is[7] = {0x69, 0x01, 0x30, 0x01, 0x31, 0x49, 0};
void print_in_locale (const char *locale, const wchar_t *str, const size_t len) {
wchar_t *dest = calloc(len * 2, sizeof(wchar_t));
int i;
if (!setlocale(LC_CTYPE, locale)) {
fprintf(stderr, "Locale %s failed with error: %s", locale, strerror(errno));
exit(1);
}
for (i = 0; i < len; i++) {
dest[i] = towupper(str[i]);
}
printf("%ls, %ls\n", str, dest);
free(dest);
}
int main () {
print_in_locale("de_DE.utf8", latin_small_sharp_s, 5);
print_in_locale("tr_TR.utf8", turkish_is, 5);
print_in_locale("de_DE.utf8", turkish_is, 5);
}
If you saved it to "locale_test.c", you can run it on the command line with...
gcc -o locale_test locale_test.c && ./locale_test
Use the Data.Text.ICU.toUpper
function from the text-icu
package.
toUpper :: LocaleName -> Text -> Text
Uppercase the characters in a string.
Casing is locale dependent and context sensitive. The result may be longer or shorter than the original.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With