Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import non-ASCII characters into console?

Tags:

c++

c++17

I've been scratching my head for a while at this and I am in need of some assistance. Basically what I want the code to do is read in a series of non-ASCII symbols into an empty pre-set array, and I'm printing them to see if they do get read in which they currently did not. Notepad displays them just fine but for some reason C++ doesn't recognise them as valid characters, any advice that is only about code and not changing the internal settings of my computer are strongly preferred.

char displayCharacters[5] = {};

try {

    instream.open("characters.txt");
    instream >> displayCharacters;
    cout << "Here is the first symbol: " << displayCharacters[4];

} 

catch (exception) {

    cout << "Something went wrong with the file handling.";

}

And yes I have set up the instreams correctly, with the cout having been used from the import of iostream and using namespace std. Here's what the file contains:

█
 
▀
▄
▓

Edit: The file is UTF-8 if you need to know.

like image 811
Kitso Avatar asked Jan 25 '23 13:01

Kitso


1 Answers

tl;dr;

You need to decode UTF-8 before you can index it. Read on for more details than I was expecting to write…


A C++ stream isn’t encoding-aware – it’s just a stream of bytes. For example, this code to dump an entire UTF-8 string works just fine:

#include <iostream>
#include <sstream>
#include <string>

int main() {
    // Simulate your `instream` using an `std::stringstream`
    std::stringstream instream;
    // Load the simulated `instream` using a UTF-8 string literal [1]
    instream << u8"█\n \n▀\n▄\n▓\n";
    
    // Print entire `instream`
    std::cout << instream.rdbuf();
}

[1]: https://en.cppreference.com/w/cpp/language/string_literal

Your problem comes from the UTF-8 encoding itself. UTF-8 is a multibyte encoding. Some characters (notably the ASCII characters) are encoded as a single byte. For instance, the letter a is encoded as the value 97 (0x61 in hex).

Let’s take a look at the five characters you’re trying to print:

Char Unicode codepoint UTF-8 encoding Unicode name
U+2588 0xe2 0x96 0x88 FULL BLOCK
U+20 0x20 SPACE (no link; this one’s just plain ASCII)
U+2580 0xe2 0x96 0x80 UPPER HALF BLOCK
U+2584 0xe2 0x96 0x84 LOWER HALF BLOCK
U+2593 0xe2 0x96 0x93 DARK SHADE

The UTF-8 encoding is the interesting part here – that’s how each of these characters is stored as a sequence of bytes in a UTF-8 encoded file. For the four block-drawing characters (we’ll ignore the space because that’s just a single-byte character), the encoding takes three bytes.

But why does the encoding take three bytes if the codepoint is only two bytes long?

Good question. Let’s break down the first character:

   0xe2     0x96     0x88
 11100010 10010110 10001000
 AAAA^^^^ BB^^^^^^ BB^^^^^^

The annotations underneath the binary indicates how the encoding works.

Since the codepoint for the character is too big to fit into a single byte, UTF-8 breaks it into multiple bytes. However, there must be a way to determine that a sequence of bytes represents a single character, not just a sequence of simpler characters. This is where the byte prefixes (A, B and C) come into play. The first byte in the multibyte sequence begins with a sequence of 1 bits to represent the total number of bytes in the encoded character, followed by a terminating 0. Here we need three bytes, so we have 1110 (A).

The prefixes of the remaining two bytes indicate that they are continuation bytes (i.e. they should not be considered the beginning of a character). The prefix for continuation bytes is defined as 10 (B).

After removing these prefixes, he remaining bits (marked with carets [^]) are packed and parsed to retrieve the encoded codepoint.

Single byte characters (i.e. the basic US-ASCII plane of characters from 0 to 127) only require 7 bits to encode, so a 0 bit is prefixed to indicate there are no continuation bytes for this character.

What does all this have to do with your problem?

I said earlier that “your problem comes from the UTF-8 encoding itself”. Well, I lied. Sorry. Your problem comes from attempting to read UTF-8 encoded data as a plain sequence of bytes.

With the encoding table above, let’s take a look at the raw bytes in your file (assuming a single \n terminating each line):

e2 96 88 0a 20 0a e2 96 80 0a e2 96 84 0a e2 96 93 0a
\--01--/    02    \--03--/    \--04--/    \--05--/

I’ve marked the characters by their line numbers.

From this dump, you can easily see what the output of your problematic code will be:

char displayCharacters[5] = {};
std::cout << "Here is the first symbol: " << displayCharacters[4];

It’s a space! Remember, the stream isn’t aware of the file’s encoding so it just spits out a sequence of bytes (a char in C/C++ is just an 8-bit variable). Your array (displayCharacters) contains the sequence of bytes shown above, so subscripting it to get the fourth (zero-indexed) element returns the byte 0x20.

You actually got lucky here. Indexing UTF-8 data as raw bytes often causes much uglier errors. Remember those continuation bytes (beginning 10)? If you extract and try to print one of those on its own, your terminal will have no idea what to do with it. Similarly with the beginning of a multibyte sequence (prefix 11).

Properly indexing UTF-8 strings is hard. You’ll almost certainly want a library to handle it.

Depending on the use and/or origin of the file in question, you might want to consider a fixed-width encoding such as UTF-32.

like image 155
MTCoster Avatar answered Jan 29 '23 20:01

MTCoster