Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does POSIX specify wctomb as non-thread-safe, but not mbtowc?

In XSH 2.9.1, wctomb is listed as one of the functions which is not required to be thread-safe. However, the opposite conversion function, mbtowc, does not appear in the list. On an implementation with encodings that use shift states, neither has a thread-safe API, and it makes no sense that one is required to be thread-safe and the other is not, while neither can be thread-safe without forbidding stateful encodings.

Likewise for wcstombs (which is in the list) and mbstowcs (which is not). Since both of these functions operate on entire strings which begin and end in the initial shift state, they are not stateful, their APIs are thread-safe, and again it makes no sense that one direction is specified to be thread-safe but not the other.

Can anyone shed some light on this?

like image 207
R.. GitHub STOP HELPING ICE Avatar asked Jan 24 '11 17:01

R.. GitHub STOP HELPING ICE


3 Answers

As you noted in the question, wctomb has, or at least is allowed to have, a (hidden) "shift state": see http://pubs.opengroup.org/onlinepubs/009695399/functions/wctomb.html and compare this to wcrtomb: http://pubs.opengroup.org/onlinepubs/009695399/functions/wcrtomb.html which has an explicit "state" pointer.

POSIX basically allows the programmer to implement wctomb as a call to wcrtomb using a static variable to hold the shift state. If that variable is not a per-thread item, it won't be thread-safe.

(all of this is pretty obvious and is contained in your question, I'm just repeating it for clarity here)

Note that no argument to wctomb gives you any sort of explicit control of the hidden shift state (if there is one). Specifically, you can't reset it to the start state.

But now look at mbtowc: http://pubs.opengroup.org/onlinepubs/009695399/functions/mbtowc.html where the text says:

For a state-dependent encoding, this function is placed into its initial state by a call for which its character pointer argument, s, is a null pointer. Subsequent calls with s as other than a null pointer shall cause the internal state of the function to be altered as necessary. A call with s as a null pointer shall cause this function to return a non-zero value if encodings have state dependency, and 0 otherwise. If the implementation employs special bytes to change the shift state, these bytes shall not produce separate wide-character codes, but shall be grouped with an adjacent character.

That is, they give you an explicit way to detect and control the hidden state (if it exists). So even if it exists and is not thread-specific, you can "do your own control" as it were.

While I can't swear to it, I think this is why mbtowc is not listed as being non-thread-safe.

(This is not the way I would have written the text in the standard, though! Then again, if I had my way, a lot of these functions would not even exist. :-) )

like image 148
torek Avatar answered Nov 04 '22 02:11

torek


The fundamental asymmetry is that ISO C requires that wide characters have fixed width (same for all characters), and that the encoding does not have shift states. In contrast, multibyte encoding is dependent on the locale and may have varying character width and also shift states.

All the four functions have internal state kept between calls (mbstowcs and wcstombs also have to because they convert only a specified number of bytes as opposed to full strings that would otherwise finish in the initial shift state).

There is a subtle difference of what the internal state consists of in case of the string conversions. For mbstowcs, a whole number of wide characters is converted in a single call. This is because wide characters have a fixed width, and also because the n parameter of the call is specified in characters, not bytes. In contrast, for wcstombs the n parameter is specified in bytes, not in multi-byte characters. Consequently, the state kept for wcstombs has to include not only the shift state, but also the remainder of a partially output multibyte character. Because the state is thus multi-part, operations (load and store) on it are not going to be atomic on a typical architecture without additional locking.

At this point it is important to remind ourselves that "thread safety" has a pretty technical meaning in POSIX, namely that parallel invocations are logically serializable. It does not mean that parallel usage is necessarily very useful. As all the four functions keep internal state, it is hard to imagine a caller who processes a single linear string (at a time) left to right, but spreading the calls across multiple threads. This is witnessed by the introduction of mbrtowc, wcrtomb, mbsrtowcs and wcstombs in Amendment 1 to the ISO C 89/90, with the r flag standing specifically for "reentrant".

I cannot exactly explain why having an "atomically accessible" internal state should make it easier to make the respective call easier to implement in a thread safe way (because there sometimes have to be multiple accesses, load and store, during a single call), but maybe it is because the burden of additional locking (and reloading) is only imposed in the rarely visited code branch where an actual shift state is taking place.

There is also another catch to explain. A concurrent thread may call setlocale changing the character encoding (LC_CTYPE) category of the locale. ISO C standard specifies that such an action causes the current state (and, by the way, even a state well captured using wcrtomb) to become undefined. This is because the shift states of different locales may not map to each other in useful or specified ways. Although this is a threaded scenario that has the potential of breaking even the "reentrant" family of functions, it does not necessarily pose an obstacle to a formally thread safe implementation because the locale setting can be cached throughout each call.

like image 41
Jirka Hanika Avatar answered Nov 04 '22 02:11

Jirka Hanika


I think this is simply based on the assumption that with a set of wide character encoded data, the programmer can predict how much memory needs to be allocated/freed per thread because of the fixed width of the code point. But when going in the other direction, depending on the encoding, it might not be "predictable" how much memory will need to be allocated in advance, thus creating greater room for error.

Update: after having found an older version of the standard, I noticed that there is a difference in the wording in the man page for the wctomb() function: "The wctomb() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe." I think this suggests another implicit assumption made in the standard: mbtowc() is or should be reentrant...

like image 1
Zack Avatar answered Nov 04 '22 02:11

Zack