Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading UTF-8 characters from console

I'm trying to read UTF-8 encoded polish characters from console for my c++ application. I'm sure that console uses this code page (checked in properties). What I have already tried:

  • Using cin - instead of "zażółć" I read "za\0\0\0\0"
  • Using wcin - instead of "zażółć" - same result as with cin
  • Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0'
  • Using wscanf - same result as with scanf
  • Using getchar to read characters one by one - same result as with scanf

On the beginning of the main function I have following lines:

setlocale(LC_ALL, "PL_pl.UTF-8");
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);

I would be really greatful for help.

like image 485
J. Łyskawa Avatar asked Jan 09 '18 20:01

J. Łyskawa


2 Answers

Although you’ve already accepted an answer, here’s a more portable version, which sticks closer to the standard library. Unfortunately, this is one area where I’ve found that a lot of widely-used implementations do not support things that are supposedly in the standard. For example, there is supposed to be a standard way to print multi-byte strings (which theoretically could be something unusual like shift-JIS, but in practice are UTF-8 on every modern OS), but it does not actually work portably. Microsoft’s runtime library is especially poor in this regard, but I’ve also found bugs in libc++.

/* Boilerplate feature-test macros: */
#if _WIN32 || _WIN64
#  define _WIN32_WINNT  0x0A00 // _WIN32_WINNT_WIN10
#  define NTDDI_VERSION 0x0A000002 // NTDDI_WIN10_RS1
#  include <sdkddkver.h>
#else
#  define _XOPEN_SOURCE     700
#  define _POSIX_C_SOURCE   200809L
#endif

#include <iostream>
#include <locale>
#include <locale.h>
#include <stdlib.h>
#include <string>

#ifndef MS_STDLIB_BUGS // Allow overriding the autodetection.
/* The Microsoft C and C++ runtime libraries that ship with Visual Studio, as
 * of 2017, have a bug that neither stdio, iostreams or wide iostreams can
 * handle Unicode input or output.  Windows needs some non-standard magic to
 * work around that.  This includes programs compiled with MinGW and Clang
 * for the win32 and win64 targets.
 *
 * NOTE TO USERS OF TDM-GCC: This code is known to break on tdm-gcc 4.9.2. As
 * a workaround, "-D MS_STDLIB_BUGS=0" will at least get it to compile, but
 * Unicode output will still not work.
 */
#  if ( _MSC_VER || __MINGW32__ || __MSVCRT__ )
    /* This code is being compiled either on MS Visual C++, or MinGW, or
     * clang++ in compatibility mode for either, or is being linked to the
     * msvcrt (Microsoft Visual C RunTime) library.
     */
#    define MS_STDLIB_BUGS 1
#  else
#    define MS_STDLIB_BUGS 0
#  endif
#endif

#if MS_STDLIB_BUGS
#  include <io.h>
#  include <fcntl.h>
#endif

using std::endl;
using std::istream;
using std::wcin;
using std::wcout;

void init_locale(void)
// Does magic so that wcout can work.
{
#if MS_STDLIB_BUGS
  // Windows needs a little non-standard magic.
  constexpr char cp_utf16le[] = ".1200";
  setlocale( LC_ALL, cp_utf16le );
  _setmode( _fileno(stdout), _O_WTEXT );
  _setmode( _fileno(stdin), _O_WTEXT );
#else
  // The correct locale name may vary by OS, e.g., "en_US.utf8".
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  wcout.imbue(std::locale());
  wcin.imbue(std::locale());
#endif
}

int main(void)
{
  init_locale();

  static constexpr size_t bufsize = 1024;
  std::wstring input;
  input.reserve(bufsize);

  while ( wcin >> input )
    wcout << input << endl;

  return EXIT_SUCCESS;
}

This reads in wide-character input from the console regardless of its initial locale or code page. If what you meant instead was that the input will be bytes in the UTF-8 encoding (such as from a redirected file in UTF-8 encoding), not console input, the standard way to accomplish this is supposed to be the conversion facet from UTF-8 to wchar_t in <codecvt> and <locale>, but in practice Windows doesn’t support Unicode locales, so you have to read the bytes in and then convert them manually. A more standard way to do that is mbstowcs(). I have some old code to do the conversion for STL iterators, but there are also conversion functions in the standard library. You might need to do this anyway, if for example you need to save or transmit in UTF-8.

There are some who will recommend you store all strings in UTF-8 internally even when using an API like Windows based on some form of UTF-16, converting to another encoding only when you make API calls. I strongly advise you to use UTF-8 externally whenever you possibly can, but I don’t go quite that far. Note, however, that storing strings as UTF-8 will save you a lot of memory, especially on systems where wchar_t is UCS-32. You would have a better idea than I how many bytes this would typically save you for Polish text.

like image 118
Davislor Avatar answered Oct 17 '22 15:10

Davislor


Here is the trick I use for UTF-8 support. The result is multibyte string which could be then used elsewhere:

#include <cstdio>
#include <windows.h>
#define MAX_INPUT_LENGTH 255

int main()
{

    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

    wchar_t wstr[MAX_INPUT_LENGTH];
    char mb_str[MAX_INPUT_LENGTH * 3 + 1];

    unsigned long read;
    void *con = GetStdHandle(STD_INPUT_HANDLE);

    ReadConsole(con, wstr, MAX_INPUT_LENGTH, &read, NULL);

    int size = WideCharToMultiByte(CP_UTF8, 0, wstr, read, mb_str, sizeof(mb_str), NULL, NULL);
    mb_str[size] = 0;

    std::printf("ENTERED: %s\n", mb_str);

    return 0;
}

Should look like this:

enter image description here

P.S. Big thanks to Remy Lebeau for pointing out some flaws!

like image 20
Killzone Kid Avatar answered Oct 17 '22 17:10

Killzone Kid