Templated string class use of strcmp, strcpy and strlen

Question

I overheard sometime ago a discussion about how when creating a templated string class that you should not use strcmp, strcpy and strlen for a templated string class that can make use of UTF8 and UTF16. From what I recall, you are suppose to use functions from algorithm.h, however, I do not remember how the implementation is, or why it is so. Could someone please explain what functions to use instead, how to use them and why?

The example of the templated string class would be something such as

String<UTF8> utf8String;
String<UTF16> utf16String;

This is where UTF8 will be a unsigned char and UTF16 is an unsigned short.

The example of the templated string class would be something such as

String<UTF8> utf8String;
String<UTF16> utf16String;

This is where UTF8 will be a unsigned char and UTF16 is an unsigned short.

bames53 · Accepted Answer

First off, C++ has no need of additional string classes. There are probably already hundreds or thousands too many string classes that have been developed, and yours won't improve the situation. Unless you're doing this purely for your edification, you should think long and hard and then decide not to write a new one.

You can use std::basic_string<char> to hold UTF-8 code unit sequences, std::basic_string<char16_t> to hold UTF-16 code unit sequences, std::basic_string<char32_t> to hold UTF-32 code unit sequences, etc. C++ even offers short, handy names for these types: string, u16string, and u32string. basic_string already solves the problem you're asking about here by offering member functions for copying, comparing, and getting the length of the string that work for any code unit you template it with.

I can't think of any good reason for new code that's not interfacing with legacy code to use anything else as its canonical storage type for strings. Even if you do interface with legacy code that uses something else, if the surface area of that interface isn't large you should probably still use one of the standard types and not anything else, and of course if you're interfacing with legacy code you'll be using that legacy type anyway, not writing your own new type.

With that said, the reason you can't use strcmp, strcpy, and strlen for your templated string type is that they all operate on null terminated byte sequences. If your code unit is larger than one byte then there may be bytes that are zero before the actual terminating null code unit (assuming you use null termination at all, which you probably shouldn't). Consider the bytes of this UTF-16 representation of the string "Hello" (on a little endian machine).

48 00 65 00 6c 00 6c 00  6f 00

Since UTF-16 uses 16 bit code units, the character 'H' ends up stored as the two bytes 48 00. A function operating on the above sequence of bytes by assuming the first null byte is the end would assume that the second half of the first character marks the end of the whole string. This clearly will not work.

So, strcmp, strcpy, and strlen are all specialized versions of algorithms that can be implemented more generally. Since they only work with byte sequences, and you need to work with code unit sequences where the code unit may be larger than a byte, you need need generic algorithms that can work with any code unit. The standard library offers has lots of generic algorithms to offer you. Here are my suggestions for replacing these str* functions.

strcmp compares two sequences of code units and returns 0 if the two sequences are equal, positive if the first is lexicographically less than the second, and negative otherwise. The standard library contains the generic algorithm lexicographical_compare which does nearly the same thing, except that it returns true if the first sequences is lexicographically less than the second and false otherwise.

strcpy copies a sequences of code units. You can use the standard library's copy algorithm instead.

strlen takes a pointer to a code unit and counts the number of code units before it finds a null value. If you need this function as opposed to one that just tells you the number of code units in the string, you can implement it with the algorithm find by passing the null value as the value to be found. If instead you want to find the actual length of the sequence, your class should just offer a size method that directly accesses whatever method your class uses internally to store the size.

Unlike the str* functions, the algorithms I've suggested take two iterators to demarcate code unit sequences; one pointing to the first element in the sequence, and one pointing to the position after the final element of the sequence. The str* functions only take a pointer to the first element and then assume the sequence continues until the first zero valued code unit it finds. When you're implementing your own templated string class it's best to move away from the explicit null termination convention as well, and just offer an end() method that provides the correct end point for your string.

David Schwartz · Answer

The reason you can't use strcmp, strcpy, or strlen is that they operate on strings whose length is indicate by a terminating zero byte. Since your strings may contain zero bytes inside them, you can't use these functions.

I would just code exactly what you want. What you want depends on what you're trying to do.

Templated string class use of strcmp, strcpy and strlen

Tags:

c++

string

utf-8

utf-16

implementation

mmurphy

2 Answers

bames53

David Schwartz

Recent Activity

Donate For Us

Templated string class use of strcmp, strcpy and strlen

Tags:

c++

string

utf-8

utf-16

implementation

mmurphy

2 Answers

bames53

David Schwartz

Related questions

Recent Activity

Donate For Us