Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing foreign-language string character by character

I know that this question might be very elementary. Please excuse me if this is something that is obvious. Consider the following program:

#include <stdio.h>

int main(void) {
   // this is a string in English
   char * str_1 = "This is a string.";
   // this is a string in Russian
   char * str_2 = "Это строковая константа.";
   // iterator
   int i;
   // print English string as a string
   printf("%s\n", str_1);
   // print English string byte by byte
   for(i = 0; str_1[i] != '\0'; i++) {
      printf(" %c  ",(char) str_1[i]);
   }
   printf("\n");
   // print numerical values of English string byte by byte
   for(i = 0; str_1[i] != '\0'; i++) {
      printf("%03d ",(int) str_1[i]);
   }
   printf("\n");
   // print Russian string as a string
   printf("%s\n", str_2);
   // print Russian string byte by byte
   for(i = 0; str_2[i] != '\0'; i++) {
      printf(" %c  ",(char) str_2[i]);
   }
   printf("\n");
   // print numerical values of Russian string byte by byte
   for(i = 0; str_2[i] != '\0'; i++) {
      printf("%03d ",(int) str_2[i]);
   }
   printf("\n");
   return(0);
}

Output:

This is a string.
 T   h   i   s       i   s       a       s   t   r   i   n   g   .
084 104 105 115 032 105 115 032 097 032 115 116 114 105 110 103 046
Это строковая константа.
 ▒   ▒   ▒   ▒   ▒   ▒       ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒       ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   ▒   .
-48 -83 -47 -126 -48 -66 032 -47 -127 -47 -126 -47 -128 -48 -66 -48 -70 -48 -66 -48 -78 -48 -80 -47 -113 032 -48 -70 -48 -66 -48 -67 -47 -127 -47 -126 -48 -80 -48 -67 -47 -126 -48 -80 046

It can be seen that an English (ASCII) string can be printed as a string or accessed using array indexes and printed character by character (byte by byte), but a Russian string (I believe encoded as UTF-8) can be printed as a string but not accessed character by character.

I understand that the reason why is that in this case the Russian characters are encoded using two bytes instead of one.

What I am wondering is whether there is any easy way to print a Unicode string character by character (in this case two bytes by two bytes) using standard C library functions by proper declaration of a data type or by labeling the string somehow or by setting a locale or in some other way.

I tried preceding the Russian string by "u8", that is char * str_2 = u8"...", but this doesn't change the behavior. I'd like to stay away from using wide characters that make assumptions about what language is being used, for example exactly two bytes per character. Any advice would be appreciated.

like image 962
Thomas Hedden Avatar asked Jan 09 '17 03:01

Thomas Hedden


1 Answers

I think the mblen(), mbtowc(), wctomb(), mbstowcs() and wcstombs() functions from <stdlib.h> are partially relevant. You can find out how many bytes make up each character in the string with mblen(), for example.

Another seldom-used header and function that's material is <locale.h> and setlocale().

Here's an adaptation of your code:

#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static inline void ntbs_hex_dump(const char *pc_ntbs)
{
    unsigned char *ntbs = (unsigned char *)pc_ntbs;
    for (int i = 0; ntbs[i] != '\0'; i++)
        printf(" %.2X ", ntbs[i]);
    putchar('\n');
}

static inline void ntbs_chr_dump(const char *pc_ntbs)
{
    unsigned char *ntbs = (unsigned char *)pc_ntbs;
    for (int i = 0; ntbs[i] != '\0'; i++)
        printf(" %c  ", ntbs[i]);
    putchar('\n');
}

int main(void)
{
    char *loc = setlocale(LC_ALL, "");
    printf("Locale: %s\n", loc);

    char *str_1 = "This is a string.";
    char *str_2 = "Это строковая константа.";

    printf("English:\n");
    printf("%s\n", str_1);
    ntbs_chr_dump(str_1);
    ntbs_hex_dump(str_1);

    printf("Russian:\n");
    printf("%s\n", str_2);
    ntbs_chr_dump(str_2);
    ntbs_hex_dump(str_2);

    char *mbp = str_2;
    while (*mbp != '\0')
    {
        enum { MBS_LEN = 10 };
        int mbl = mblen(mbp, strlen(mbp));
        char mbs[MBS_LEN];
        assert(mbl < MBS_LEN - 1 && mbl > 0);
        // printf("mbl = %d\n", mbl);
        memmove(mbs, mbp, mbl);
        mbs[mbl] = '\0';
        printf(" %s ", mbs);
        mbp += mbl;
    }
    putchar('\n');

    return(0);
}

The setlocale() is important, at least on macOS Sierra 10.12.2 (with GCC 6.3.0), which is where I developed and tested it. Without that, mblen() always returns 1, and there is no benefit in the code.

The output I get from that is:

Locale: en_US.UTF-8
English:
This is a string.
 T   h   i   s       i   s       a       s   t   r   i   n   g   .  
 54  68  69  73  20  69  73  20  61  20  73  74  72  69  6E  67  2E 
Russian:
Это строковая константа.
 ?   ?   ?   ?   ?   ?       ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?       ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   ?   .  
 D0  AD  D1  82  D0  BE  20  D1  81  D1  82  D1  80  D0  BE  D0  BA  D0  BE  D0  B2  D0  B0  D1  8F  20  D0  BA  D0  BE  D0  BD  D1  81  D1  82  D0  B0  D0  BD  D1  82  D0  B0  2E 
 Э  т  о     с  т  р  о  к  о  в  а  я     к  о  н  с  т  а  н  т  а  . 

With a bit more effort, the code could print the pairs of bytes for the UTF-8 data more closely together. The D0 and D1 leading bytes are correct for the UTF-8 encoding of the Cyrillic code block U+0400 .. U+04FF in the BMP (basic multilingual plane).

Just for your amusement value: the BSD sed refused to process the output because those question marks represent invalid codes: sed: RE error: illegal byte sequence.

like image 182
Jonathan Leffler Avatar answered Sep 19 '22 07:09

Jonathan Leffler