I wonder how to inspect file Byte Order Mark in order to get if it is UTF-8 in C++?
This is my version in C++:
#include <fstream>
/* Reads a leading BOM from file stream if it exists.
* Returns true, iff the BOM has been there. */
bool ReadBOM(std::ifstream & is)
{
/* Read the first byte. */
char const c0 = is.get();
if (c0 != '\xEF') {
is.putback(c0);
return false;
}
/* Read the second byte. */
char const c1 = is.get();
if (c1 != '\xBB') {
is.putback(c1);
is.putback(c0);
return false;
}
/* Peek the third byte. */
char const c2 = is.peek();
if (c2 != '\xBF') {
is.putback(c1);
is.putback(c0);
return false;
}
return true; // This file contains a BOM for UTF-8.
}
if (buffer[0] == '\xEF' && buffer[1] == '\xBB' && buffer[2] == '\xBF') {
// UTF-8
}
It's better to use buffer[0] == '\xEF'
instead of buffer[0] == 0xEF
in order to avoid signed/unsigned char problems, see How do I represent negative char values in hexadecimal?
In general, you can't.
The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:
0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff, XX, XX -- The file is almost certainly UTF-16BE
0xff, 0xfe, XX, XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf, XX -- The file is almost certainly UTF-8 With a BOM
But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.
In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.
There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.
0xEF,0xBB,0xBF
ordering doesn't depend on endianness.
How you read the file with C++ is up to you. Personally I still use C-style File
methods because they are provided by the library I am coding with and I can be sure to specify to binary mode and avoid unintended translations down the line.
adapted from cs.vt.edu
#include <fstream>
...
char buffer[100];
ifstream myFile ("data.bin", ios::in | ios::binary);
myFile.read (buffer, 3);
if (!myFile) {
// An error occurred!
// myFile.gcount() returns the number of bytes read.
// calling myFile.clear() will reset the stream state
// so it is usable again.
}
...
if (!myFile.read (buffer, 100)) {
// Same effect as above
}
if (buffer[0] == 0XEF && buffer[1] == 0XBB && buffer[2] == 0XBF) {
//Congrats, UTF-8
}
Alternatively, many format use UTF-8 by default if no other BOM (UTF-16, or UTF-32 for example) are specified.
wiki for BOM
unicode.org.faq
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With