My setup: gcc-4.9.2, UTF-8 environment. The following C-program works in ASCII, but does not in UTF-8. Create input file: <pre class="prettyprint"><code>echo -n 'привет мир' > /tmp/вход </code></pre> This is test.c: <pre class="prettyprint"><code>#include <stdio.h> #include <stdlib.h> #include <string.h> #define SIZE 10 int main(void) { char buf[SIZE+1]; char *pat = "привет мир"; char str[SIZE+2]; FILE *f1; FILE *f2; f1 = fopen("/tmp/вход","r"); f2 = fopen("/tmp/выход","w"); if (fread(buf, 1, SIZE, f1) > 0) { buf[SIZE] = 0; if (strncmp(buf, pat, SIZE) == 0) { sprintf(str, "% 11s\n", buf); fwrite(str, 1, SIZE+2, f2); } } fclose(f1); fclose(f2); exit(0); } </code></pre> Check the result: <pre class="prettyprint"><code>./test; grep -q ' привет мир' /tmp/выход && echo OK </code></pre> What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

<pre class="prettyprint"><code>#define SIZE 10 </code></pre> The buffer size of 10 is insufficient to store the UTF-8 string <code>привет мир</code>. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly. UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.

How to use UTF-8 in C code?

Tags:

My setup: gcc-4.9.2, UTF-8 environment.

The following C-program works in ASCII, but does not in UTF-8.

Create input file:

Click to copy

echo -n 'привет мир' > /tmp/вход

This is test.c:

Click to copy

#include <stdio.h> #include <stdlib.h> #include <string.h>  #define SIZE 10  int main(void) {   char buf[SIZE+1];   char *pat = "привет мир";   char str[SIZE+2];    FILE *f1;   FILE *f2;    f1 = fopen("/tmp/вход","r");   f2 = fopen("/tmp/выход","w");    if (fread(buf, 1, SIZE, f1) > 0) {     buf[SIZE] = 0;      if (strncmp(buf, pat, SIZE) == 0) {       sprintf(str, "% 11s\n", buf);       fwrite(str, 1, SIZE+2, f2);     }   }    fclose(f1);   fclose(f2);    exit(0); }

Check the result:

Click to copy

./test; grep -q ' привет мир' /tmp/выход && echo OK

What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

263

asked May 22 '15 03:05

Igor Liferenko

2 Answers

Click to copy

#define SIZE 10

The buffer size of 10 is insufficient to store the UTF-8 string привет мир. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.

UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.

177

answered Oct 01 '22 05:10

Siddhartha Ghosh

This is more of a corollary to the other answers, but I'll try to explain this from a slightly different angle.

Here is Jonathan Leffler's version of your code, with three slight changes: (1) I made explicit the actual individual bytes in the UTF-8 strings; and (2) I modified the sprintf formatting string width specifier to hopefully do what you are actually attempting to do. Also tangentially (3) I used perror to get a slightly more useful error message when something fails.

Click to copy

#include <stdio.h> #include <stdlib.h> #include <string.h>  #define SIZE 40  int main(void) {   char buf[SIZE + 1];   char *pat = "\320\277\321\200\320\270\320\262\320\265\321\202"     " \320\274\320\270\321\200";  /* "привет мир" */   char str[SIZE + 2];    FILE *f1 = fopen("\320\262\321\205\320\276\320\264", "r");  /* "вход" */   FILE *f2 = fopen("\320\262\321\213\321\205\320\276\320\264", "w");  /* "выход" */    if (f1 == 0 || f2 == 0)     {       perror("Failed to open one or both files");  /* use perror() */       return(1);     }    size_t nbytes;   if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)     {       buf[nbytes] = 0;        if (strncmp(buf, pat, nbytes) == 0)         {           sprintf(str, "%*s\n", 1+(int)nbytes, buf);  /* nbytes+1 length specifier */           fwrite(str, 1, 1+nbytes, f2); /* +1 here too */         }     }    fclose(f1);   fclose(f2);    return(0); }

The behavior of sprintf with a positive numeric width specifier is to pad with spaces from the left, so the space you tried to use is superfluous. But you have to make sure the target field is wider than the string you are printing in order for any padding to actually take place.

Just to make this answer self-contained, I will repeat what others have already said. A traditional char is always exactly one byte, but one character in UTF-8 is usually not exactly one byte, except when all your characters are actually ASCII. One of the attractions of UTF-8 is that legacy C code doesn't need to know anything about UTF-8 in order to continue to work, but of course, the assumption that one char is one glyph cannot hold. (As you can see, for example, the glyph п in "привет мир" maps to the two bytes -- and hence, two chars -- "\320\277".)

This is clearly less than ideal, but demonstrates that you can treat UTF-8 as "just bytes" if your code doesn't particularly care about glyph semantics. If yours does, you are better off switching to wchar_t as outlined e.g. here: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

However, the standard wchar_t is less than ideal when the standard expectation is UTF-8. See e.g. the GNU libunistring documentation for a less intrusive alternative, and a bit of background. With that, you should be able to replace char with uint8_t and the various str* functions with u8_str* replacements and be done. The assumption that one glyph equals one byte will still need to be addressed, but that becomes a minor technicality in your example program. An adaptation is available at http://ideone.com/p0VfXq (though unfortunately the library is not available on http://ideone.com/ so it cannot be demonstrated there).

answered Oct 01 '22 04:10

tripleee

Related questions
                            
                                Why does the andThen of Future not chain the result?
                            
                                Add search toolbar over google map like in native android app
                            
                                CMake warnings under OS X: MACOSX_RPATH is not specified for the following targets
                            
                                How to run Roslyn instead csc.exe from command line?
                            
                                How do I reset the application data after each test with Xcode 7 UI Testing?
                            
                                Achieving Stackless recursion in Java 8
                            
                                pandas pivot_table column names
                            
                                java.lang.RuntimeException: Performing stop of activity that is not resumed in android
                            
                                Having trouble setting working directory
                            
                                How to use a record type variable in plpgsql?
                            
                                JSONDecodeError: Expecting value: line 1 column 1
                            
                                Preventing Sublime to Search Automatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use UTF-8 in C code?

Tags:

Igor Liferenko

People also ask

2 Answers

Siddhartha Ghosh

tripleee

Recent Activity

Donate For Us