How do you cope with signed char -> int issues with standard library?

Tags:

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...

C naively defined all of its character test functions for an int:

int isspace(int ch);

But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.

And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.

So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.

Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?

I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...

** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).

So, my question is:

"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:

1) Sign expansion, and
2) variable-width character issues

After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.

Please note:

No matter what size your char_type is, it's wrong for most character encoding schemes.

This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.

Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...

Thank You

Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.

210

asked Nov 10 '11 16:11

Mordachai

1 Answers

How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...

Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).

IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.

If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.

answered Nov 07 '22 00:11

Mooing Duck

Related questions
                            
                                Convert single char to int
                            
                                Threads and simple Dead lock cure
                            
                                Linking Boost to my C++ project in Eclipse
                            
                                Is there a practical benefit to casting a NULL pointer to an object and calling one of its member functions?
                            
                                Looking forward to a programming future but confused where to start [closed]
                            
                                Even lighter than SQLite
                            
                                C++: Difference between NVI and Template Method Patterns?
                            
                                How are structs laid out in memory in C++?
                            
                                Is there a function to convert EXCEPTION_POINTERS struct to a string?
                            
                                C++ assignment operator - compiler generated or custom?
                            
                                Simple makefile generation utility?
                            
                                Assignment of data-member in read-only structure, class in STL set
                            
                                Why is initialization of integer member variable (which is not const static) not allowed in C++?
                            
                                How do I call static members of a template class?
                            
                                Advantages and disadvantages of Open Watcom [closed]
                            
                                C++ - Private variables in classes
                            
                                Should I use the initializer list or perform assignments in my C++ constructors?
                            
                                Qt checkbox/toolbutton with custom/distinct check/unchecked icons
                            
                                Using QSocketNotifier to select on a char device.
                            
                                Difference between a struct and a class [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you cope with signed char -> int issues with standard library?

Tags:

c++

c

character-encoding

special-characters