I have a function to read the value of one variable (integer, double, or boolean) on a single line in an <code>ifstream</code>: <pre class="prettyprint"><code>template <typename Type> void readFromFile (ifstream &in, Type &val) { string str; getline (in, str); stringstream ss(str); ss >> val; } </code></pre> However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of <code>str</code>?

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere) You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t. <pre class="prettyprint"><code>std::fstream fs(filename); std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf()); std::wistream is(&wb); // if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs. std::wistream::int_type ch = is.get(); const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF if(ZERO_WIDTH_NO_BREAK_SPACE != ch) is.putback(ch); // now the stream can be passed around and used without worrying about the extra character in the stream. int i; readFromStream<int>(is,i); </code></pre> Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else. On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation: <pre class="prettyprint"><code>std::fstream fs(filename); char a,b,c; a = fs.get(); b = fs.get(); c = fs.get(); if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) { fs.seekg(0); } else { std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n"; } </code></pre> <hr> Additionally if you want to use <code>wchar_t</code> internally the <code>codecvt_utf8_utf16</code> and <code>codecvt_utf8</code> facets have a mode that can consume 'BOMs' for you. The only problem is that <code>wchar_t</code> is widely recognized to be worthless these days* and so you probably shouldn't do this. <pre class="prettyprint"><code>std::wifstream fin(filename); fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header)); </code></pre> * <code>wchar_t</code> is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same <code>wchar_t</code> value can be different characters in different locales so you cannot necessarily convert to <code>wchar_t</code>, switch to another locale, and then convert back to <code>char</code> in order to do <code>iconv</code>-like encoding conversions.) The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the <code>wchar_t</code> encoding, which means a single <code>wchar_t</code> isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single <code>wchar_t</code> value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)

You have to start by reading the first byte or two of the stream, and deciding whether it is part of a BOM or not. It's a bit of a pain, since you can only <code>putback</code> a single byte, whereas you typically will want to read four. The simplest solution is to open the file, read the initial bytes, memorize how many you need to skip, then seek back to the beginning and skip them.

Ignore byte-order marks in C++, reading from a stream

Tags:

c++

unicode

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:

template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
  string str;
  getline (in, str);
  stringstream ss(str);
  ss >> val;
}

However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?

421

asked Jan 16 '12 13:01

F'x

2 Answers

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)

You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.

std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
    is.putback(ch);

// now the stream can be passed around and used without worrying about the extra character in the stream.

int i;
readFromStream<int>(is,i);

Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.

On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:

std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
    fs.seekg(0);
} else {
    std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}

Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.

std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));

_{* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)}

_{The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)}

119

answered Nov 09 '22 23:11

bames53

You have to start by reading the first byte or two of the stream, and deciding whether it is part of a BOM or not. It's a bit of a pain, since you can only putback a single byte, whereas you typically will want to read four. The simplest solution is to open the file, read the initial bytes, memorize how many you need to skip, then seek back to the beginning and skip them.

answered Nov 10 '22 00:11

James Kanze

Related questions
                            
                                Delete any container using templates
                            
                                Displaying extended ASCII characters
                            
                                Question about std::less behavior
                            
                                C++ is_trivially_copyable check
                            
                                typedef resolution across namespaces
                            
                                Can I link object files made by one compile to those made by another one?
                            
                                PAM Authentication for a Legacy Application
                            
                                QML Qt openUrlExternally
                            
                                Why can't I change a private member of a class from a friend class in a different namespace?
                            
                                How to write code to call JNI using microsoft visual c++ [closed]
                            
                                Does C# use the -> pointer notation?
                            
                                When reassigning variable, the destructor is not called.. (C++)
                            
                                Gstreamer - Convert command line gst-launch to C code
                            
                                C++ allocating space for objects using inheritance
                            
                                Set path in CMake (C++, ImageMagick)
                            
                                Libev on Windows
                            
                                Qmake: how to remove compiler flag for a certain project, without changing qmake.conf?
                            
                                SWIG wrapping C++ for Python: translating a list of strings to an STL vector of STL strings
                            
                                Initialize a static const non-integral data member of a class
                            
                                What is the purpose of QWidget's parent?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With