I used to programing windows, but I want to try my hand on making a cross-platform application. And I have some questions, if you don't mind:
Question 1
Is there some way to open UNICODE\ASCII file and automatically detect it's encoding using bare ANSI C. MSDN says that fopen() can switch between various UNICODE formats (utf-8, utf-16, UNICODE BI\LI) if I will use "ccs=UNICODE" flag. It has been found experimentally that switching from UNICODE to ASCII is not happening, but trying to solve this problem, I discovered that text Unicode files has some prefixes like 0xFFFE, 0xFEFF, or 0xFEBB.
FILE *file; { __int16 isUni; file = _tfopen(filename, _T("rb")); fread(&(isUni),1,2,file); fclose(file); if( isUni == (__int16)0xFFFE || isUni == (__int16)0xFEFF || isUni == (__int16)0xFEBB) file = _tfopen(filename, _T("r,ccs=UNICODE")); else file = _tfopen(filename, _T("r")); }
So, can I make something like this cross-platform and not so ugly?
Question 2
I can do something like this in windows, but will it work in Linux?
file = fopen(filename, "r"); fwscanf(file,"%lf",buffer);
If not, then is there some sort of ANSI C function to convert ASCII strings to Unicode? I want to work with Unicode strings in my program .
Question 3
Besides, I need to output Unicode strings into console. There is setlocale(*) in windows, but what should I do in Linux? It seems that console is already Unicode there.
Question 4
Generally speaking, I want to work with Unicode in my program, but I faced some strange problems:
f = fopen("inc.txt","rt"); fwprintf(f,L"Текст"); // converted successfully fclose(f); f = fopen("inc_u8.txt","rt, ccs = UNICODE"); fprintf(f,"text"); // failed to convert fclose(f);
P.S. Is there some good book about cross-platform programming, something with comparison of windows and linux programs code? And some book about ways of using Unicode, practical methods, that is. I don't want to immerse in plain UNICODE BI\LI history, I am interested in specific C/C++ libraries.
When called, fopen allocates a FILE object on the heap. Note that the data in a FILE object is undocumented - FILE is an opaque struct, you can only use pointers-to-FILE from your code. The FILE object gets initialized.
fopen does not read any data, it returns a FILE* pointer needed to use file I/O functions like fread , fgetl , fprintf , fwrite , fseek , ftell ... fread returns the number of bytes read, so if the end of file is reached, that number will be smaller than the number of bytes requested.
Question 1:
Yes, you can detect the byte order mark, which is the byte sequence you discovered - IF YOUR FILE HAS ONE.
A search on Google and stackoverflow will do the rest. As for the 'not so ugly': you can refactor/beautify your code, e.g. write a function for determining the BOM, and do it in the beginning, then call fopen or _tfopen as required. Then you can refactor that again, and write your own fopen function. But it will still be ugly.
Question 2:
Yes, but the unicode functions are not always called the same on Linux as they are on Windows.
Use defines. Maybe write your own TCHAR.H
Question 3:
#include <locale.h> setlocale(LC_ALL, "en.UTF-8")
man 3 setlocale
Question 4:
Just use fwprintf.
The other is not a standard.
You can use the wxWidgets toolkit.
It uses unicode, and it uses classes that have implementations for the same thing on Windows and on Linux and Unix and Mac.
The better question for you is how do you convert ASCII to Unicode and vice-versa. That goes like this:
std::string Unicode2ASCII( std::wstring wstrStringToConvert ) { size_t sze_StringLength = wstrStringToConvert.length() ; if(0 == sze_StringLength) return "" ; char* chrarry_Buffer = new char[ sze_StringLength + 1 ] ; wcstombs( chrarry_Buffer, wstrStringToConvert.c_str(), sze_StringLength ) ; // Unicode2ASCII, const wchar_t* C-String 2 mulibyte C-String chrarry_Buffer[sze_StringLength] = '\0' ; std::string strASCIIstring = chrarry_Buffer ; delete chrarry_Buffer ; return strASCIIstring ; } std::wstring ASCII2Unicode( std::string strStringToConvert ) { size_t sze_StringLength = strStringToConvert.length() ; if(0 == sze_StringLength) return L"" ; wchar_t* wchrarry_Buffer = new wchar_t[ sze_StringLength + 1 ] ; mbstowcs( wchrarry_Buffer, strStringToConvert.c_str(), sze_StringLength ) ; // Unicode2ASCII, const. mulibyte C-String 2 wchar_t* C-String wchrarry_Buffer[sze_StringLength] = L'\0' ; std::wstring wstrUnicodeString = wchrarry_Buffer ; delete wchrarry_Buffer ; return wstrUnicodeString ; }
Edit: Here some insight into the available Unicode functions on Linux (wchar.h):
__BEGIN_NAMESPACE_STD /* Copy SRC to DEST. */ extern wchar_t *wcscpy (wchar_t *__restrict __dest, __const wchar_t *__restrict __src) __THROW; /* Copy no more than N wide-characters of SRC to DEST. */ extern wchar_t *wcsncpy (wchar_t *__restrict __dest, __const wchar_t *__restrict __src, size_t __n) __THROW; /* Append SRC onto DEST. */ extern wchar_t *wcscat (wchar_t *__restrict __dest, __const wchar_t *__restrict __src) __THROW; /* Append no more than N wide-characters of SRC onto DEST. */ extern wchar_t *wcsncat (wchar_t *__restrict __dest, __const wchar_t *__restrict __src, size_t __n) __THROW; /* Compare S1 and S2. */ extern int wcscmp (__const wchar_t *__s1, __const wchar_t *__s2) __THROW __attribute_pure__; /* Compare N wide-characters of S1 and S2. */ extern int wcsncmp (__const wchar_t *__s1, __const wchar_t *__s2, size_t __n) __THROW __attribute_pure__; __END_NAMESPACE_STD #ifdef __USE_XOPEN2K8 /* Compare S1 and S2, ignoring case. */ extern int wcscasecmp (__const wchar_t *__s1, __const wchar_t *__s2) __THROW; /* Compare no more than N chars of S1 and S2, ignoring case. */ extern int wcsncasecmp (__const wchar_t *__s1, __const wchar_t *__s2, size_t __n) __THROW; /* Similar to the two functions above but take the information from the provided locale and not the global locale. */ # include <xlocale.h> extern int wcscasecmp_l (__const wchar_t *__s1, __const wchar_t *__s2, __locale_t __loc) __THROW; extern int wcsncasecmp_l (__const wchar_t *__s1, __const wchar_t *__s2, size_t __n, __locale_t __loc) __THROW; #endif /* Special versions of the functions above which take the locale to use as an additional parameter. */ extern long int wcstol_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, int __base, __locale_t __loc) __THROW; extern unsigned long int wcstoul_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, int __base, __locale_t __loc) __THROW; __extension__ extern long long int wcstoll_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, int __base, __locale_t __loc) __THROW; __extension__ extern unsigned long long int wcstoull_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, int __base, __locale_t __loc) __THROW; extern double wcstod_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, __locale_t __loc) __THROW; extern float wcstof_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, __locale_t __loc) __THROW; extern long double wcstold_l (__const wchar_t *__restrict __nptr, wchar_t **__restrict __endptr, __locale_t __loc) __THROW; /* Copy SRC to DEST, returning the address of the terminating L'\0' in DEST. */ extern wchar_t *wcpcpy (wchar_t *__restrict __dest, __const wchar_t *__restrict __src) __THROW; /* Copy no more than N characters of SRC to DEST, returning the address of the last character written into DEST. */ extern wchar_t *wcpncpy (wchar_t *__restrict __dest, __const wchar_t *__restrict __src, size_t __n) __THROW; #endif /* use GNU */ /* Wide character I/O functions. */ #ifdef __USE_XOPEN2K8 /* Like OPEN_MEMSTREAM, but the stream is wide oriented and produces a wide character string. */ extern __FILE *open_wmemstream (wchar_t **__bufloc, size_t *__sizeloc) __THROW; #endif #if defined __USE_ISOC95 || defined __USE_UNIX98 __BEGIN_NAMESPACE_STD /* Select orientation for stream. */ extern int fwide (__FILE *__fp, int __mode) __THROW; /* Write formatted output to STREAM. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int fwprintf (__FILE *__restrict __stream, __const wchar_t *__restrict __format, ...) /* __attribute__ ((__format__ (__wprintf__, 2, 3))) */; /* Write formatted output to stdout. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int wprintf (__const wchar_t *__restrict __format, ...) /* __attribute__ ((__format__ (__wprintf__, 1, 2))) */; /* Write formatted output of at most N characters to S. */ extern int swprintf (wchar_t *__restrict __s, size_t __n, __const wchar_t *__restrict __format, ...) __THROW /* __attribute__ ((__format__ (__wprintf__, 3, 4))) */; /* Write formatted output to S from argument list ARG. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int vfwprintf (__FILE *__restrict __s, __const wchar_t *__restrict __format, __gnuc_va_list __arg) /* __attribute__ ((__format__ (__wprintf__, 2, 0))) */; /* Write formatted output to stdout from argument list ARG. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int vwprintf (__const wchar_t *__restrict __format, __gnuc_va_list __arg) /* __attribute__ ((__format__ (__wprintf__, 1, 0))) */; /* Write formatted output of at most N character to S from argument list ARG. */ extern int vswprintf (wchar_t *__restrict __s, size_t __n, __const wchar_t *__restrict __format, __gnuc_va_list __arg) __THROW /* __attribute__ ((__format__ (__wprintf__, 3, 0))) */; /* Read formatted input from STREAM. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int fwscanf (__FILE *__restrict __stream, __const wchar_t *__restrict __format, ...) /* __attribute__ ((__format__ (__wscanf__, 2, 3))) */; /* Read formatted input from stdin. This function is a possible cancellation point and therefore not marked with __THROW. */ extern int wscanf (__const wchar_t *__restrict __format, ...) /* __attribute__ ((__format__ (__wscanf__, 1, 2))) */; /* Read formatted input from S. */ extern int swscanf (__const wchar_t *__restrict __s, __const wchar_t *__restrict __format, ...) __THROW /* __attribute__ ((__format__ (__wscanf__, 2, 3))) */; # if defined __USE_ISOC99 && !defined __USE_GNU \ && (!defined __LDBL_COMPAT || !defined __REDIRECT) \ && (defined __STRICT_ANSI__ || defined __USE_XOPEN2K) # ifdef __REDIRECT /* For strict ISO C99 or POSIX compliance disallow %as, %aS and %a[ GNU extension which conflicts with valid %a followed by letter s, S or [. */ extern int __REDIRECT (fwscanf, (__FILE *__restrict __stream, __const wchar_t *__restrict __format, ...), __isoc99_fwscanf) /* __attribute__ ((__format__ (__wscanf__, 2, 3))) */; extern int __REDIRECT (wscanf, (__const wchar_t *__restrict __format, ...), __isoc99_wscanf) /* __attribute__ ((__format__ (__wscanf__, 1, 2))) */; extern int __REDIRECT_NTH (swscanf, (__const wchar_t *__restrict __s, __const wchar_t *__restrict __format, ...), __isoc99_swscanf) /* __attribute__ ((__format__ (__wscanf__, 2, 3))) */; # else extern int __isoc99_fwscanf (__FILE *__restrict __stream, __const wchar_t *__restrict __format, ...); extern int __isoc99_wscanf (__const wchar_t *__restrict __format, ...); extern int __isoc99_swscanf (__const wchar_t *__restrict __s, __const wchar_t *__restrict __format, ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With