Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ How to inspect file Byte Order Mark in order to get if it is UTF-8?

I wonder how to inspect file Byte Order Mark in order to get if it is UTF-8 in C++?

like image 927
myWallJSON Avatar asked Feb 01 '12 21:02

myWallJSON


4 Answers

This is my version in C++:

#include <fstream>

/* Reads a leading BOM from file stream if it exists.
 * Returns true, iff the BOM has been there. */
bool ReadBOM(std::ifstream & is)
{
  /* Read the first byte. */
  char const c0 = is.get();
  if (c0 != '\xEF') {
    is.putback(c0);
    return false;
  }

  /* Read the second byte. */
  char const c1 = is.get();
  if (c1 != '\xBB') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  /* Peek the third byte. */
  char const c2 = is.peek();
  if (c2 != '\xBF') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  return true; // This file contains a BOM for UTF-8.
}
like image 181
ManuelAtWork Avatar answered Oct 27 '22 00:10

ManuelAtWork


if (buffer[0] == '\xEF' && buffer[1] == '\xBB' && buffer[2] == '\xBF') {
    // UTF-8
}

It's better to use buffer[0] == '\xEF' instead of buffer[0] == 0xEF in order to avoid signed/unsigned char problems, see How do I represent negative char values in hexadecimal?

like image 28
user2622198 Avatar answered Oct 27 '22 00:10

user2622198


In general, you can't.

The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff,  XX,   XX     -- The file is almost certainly UTF-16BE
0xff, 0xfe,  XX,   XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf,  XX   -- The file is almost certainly UTF-8 With a BOM

But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.

There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.

like image 28
Ian Clelland Avatar answered Oct 26 '22 23:10

Ian Clelland


0xEF,0xBB,0xBF

ordering doesn't depend on endianness.

How you read the file with C++ is up to you. Personally I still use C-style File methods because they are provided by the library I am coding with and I can be sure to specify to binary mode and avoid unintended translations down the line.

adapted from cs.vt.edu

#include <fstream>
...
char buffer[100];
ifstream myFile ("data.bin", ios::in | ios::binary);
myFile.read (buffer, 3);
if (!myFile) {
    // An error occurred!
    // myFile.gcount() returns the number of bytes read.
    // calling myFile.clear() will reset the stream state
    // so it is usable again.
}
...
if (!myFile.read (buffer, 100)) {
    // Same effect as above
}
if (buffer[0] == 0XEF && buffer[1] == 0XBB && buffer[2] == 0XBF) {
    //Congrats, UTF-8
}

Alternatively, many format use UTF-8 by default if no other BOM (UTF-16, or UTF-32 for example) are specified.

wiki for BOM

unicode.org.faq

like image 38
John Avatar answered Oct 26 '22 23:10

John