Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the purpose of the s==NULL case for mbrtowc?

mbrtowc is specified to handle a NULL pointer for the s (multibyte character pointer) argument as follows:

If s is a null pointer, the mbrtowc() function shall be equivalent to the call:

mbrtowc(NULL, "", 1, ps)

In this case, the values of the arguments pwc and n are ignored.

As far as I can tell, this usage is largely useless. If ps is not storing any partially-converted character, the call will simply return 0 with no side effects. If ps is storing a partially-converted character, then since '\0' is not valid as the next byte in a multibyte sequence ('\0' can only be a string terminator), the call will return (size_t)-1 with errno==EILSEQ. and leave ps in an undefined state.

The intended usage seems to have been to reset the state variable, particularly when NULL is passed for ps and the internal state has been used, analogous to mbtowc's behavior with stateful encodings, but this is not specified anywhere as far as I can tell, and it conflicts with the semantics for mbrtowc's storage of partially-converted characters (if mbrtowc were to reset state when encountering a 0 byte after a potentially-valid initial subsequence, it would be unable to detect this dangerous invalid sequence).

If mbrtowc were specified to reset the state variable only when s is NULL, but not when it points to a 0 byte, a desirable state-reset behavior would be possible, but such behavior would violate the standard as written. Is this a defect in the standard? As far as I can tell, there is absolutely no way to reset the internal state (used when ps is NULL) once an illegal sequence has been encountered, and thus no correct program can use mbrtowc with ps==NULL.

like image 294
R.. GitHub STOP HELPING ICE Avatar asked Jan 17 '11 02:01

R.. GitHub STOP HELPING ICE


1 Answers

Since a '\0' byte must convert to a null wide character regardless of shift state (5.2.1.2 Multibyte characters), and the mbrtowc() function is specified to reset the shift state when it converts to a wide null character (7.24.6.3.2/3 The mbrtowc function), calling mbrtowc( NULL, "", 1, ps) will reset the shift state stored in the mbstate_t pointed to by ps. And if mbrtowc( NULL, "", 1, NULL) is called to use the library's internal mbstate_t object, it will be reset to an initial state. See the end of the answer for cites of the relevant bits of the standard.

I'm by no means particularly experienced with the C standard multibyte conversion functions (my experience with this kind of thing has been using the Win32 APIs for conversion).

If mbrtowc() processes a 'incomplete char' that's cut short by a 0 byte, it should return (size_t)(-1) to indicate an invalid multibyte char (and thus detect the dangerous situation you describe). In that case the conversion/shift state is unspecified (and I think you're basically hosed for that string). The multibyte 'sequence' that a conversion was attempted on but contains a '\0' is invalid and ever will be valid with subsequent data. If the '\0' wasn't intended to be part of the converted sequence, then it shouldn't have been included in the count of bytes available for processing.

If you're in a situation where you might get additional, subsequent bytes for a partial multibyte char (say from a network stream), the n you passed for the partial multibyte char shouldn't include a 0 byte, so you'll get a (size_t)(-2) returned. In this case, if you pass a '\0' while in the middle of the partial conversion, you'll lose the fact that there's an error and as a side-effect reset the mbstate_t state in use (whether it's your own or the internal one being used because you passed in a NULL pointer for ps). I think I'm essentailly restating your question here.

However I think it is possible to detect and handle this situation, but unfortunately it requires keeping track of some state yourself:

#define MB_ERROR    ((size_t)(-1))
#define MB_PARTIAL  ((size_t)(-2))

// function to get a stream of multibyte characters from somewhere
int get_next(void);

int bar(void)
{
    char c;
    wchar_t wc;
    mbstate_t state = {0};

    int in_partial_convert = 0;

    while ((c = get_next()) != EOF)
    {
        size_t result = mbrtowc( &wc, &c, 1, &state);

        switch (result) {
        case MB_ERROR:
            // this multibyte char is invalid
            return -1;
        case MB_PARTIAL:
            // do nothing yet, we need more data
            // but remember that we're in this state
            in_partial_convert = 1;
            break;
        case 1:
            // output the competed wide char
            in_partial_convert = 0;     // no longer in the middle of a conversion
            putwchar(wc);
            break;
        case 0:
            if (in_partial_convert) {
                // this 'last' multibyte char was mal-formed
                // return an error condidtion
                return -1;
            }
            // end of the multibyte string
            // we'll handle similar to EOF
            return 0;
        }
    }

    return 0;
}

Maybe not an ideal situation, but I think it shows it's not completely broken so as to be impossible to use.


Standards citations:

5.2.1.2 Multibyte characters

  • A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

  • A byte with all bits zero shall be interpreted as a null character independent of shift state.

  • A byte with all bits zero shall not occur in the second or subsequent bytes of a multibyte character.

7.24.6.3.2/3 The mbrtowc function

If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state

like image 155
Michael Burr Avatar answered Sep 20 '22 13:09

Michael Burr