I'm trying to convert a UTF-8 <code>string</code> to a ISO-8859-1 <code>char*</code> for use in legacy code. The only way I'm seeing to do this is with <code>iconv</code>. I would definitely prefer a completely <code>string</code>-based C++ solution then just call <code>.c_str()</code> on the resulting string. How do I do this? Code example if possible, please. I'm fine using <code>iconv</code> if it is the only solution you know.

I'm going to modify my code from another answer to implement the suggestion from Alf. <pre class="prettyprint"><code>std::string UTF8toISO8859_1(const char * in) { std::string out; if (in == NULL) return out; unsigned int codepoint; while (*in != 0) { unsigned char ch = static_cast<unsigned char>(*in); if (ch <= 0x7f) codepoint = ch; else if (ch <= 0xbf) codepoint = (codepoint << 6) | (ch & 0x3f); else if (ch <= 0xdf) codepoint = ch & 0x1f; else if (ch <= 0xef) codepoint = ch & 0x0f; else codepoint = ch & 0x07; ++in; if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) { if (codepoint <= 255) { out.append(1, static_cast<char>(codepoint)); } else { // do whatever you want for out-of-bounds characters } } } return out; } </code></pre> Invalid UTF-8 input results in dropped characters.

First convert UTF-8 to 32-bit Unicode. Then keep the values that are in the range 0 through 255. Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something. <hr> The C++ standard library defines a <code>std::codecvt</code> specialization that can be used, <pre class="prettyprint"><code>template<> codecvt<char32_t, char, mbstate_t> </code></pre> C++11 §22.4.1.4/3: “the specialization <code>codecvt <char32_t, char, mbstate_t></code> converts between the UTF-32 and UTF-8 encoding schemes”

Convert string from UTF-8 to ISO-8859-1

2 Answers

I'm going to modify my code from another answer to implement the suggestion from Alf.

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

Invalid UTF-8 input results in dropped characters.

answered Sep 21 '22 20:09

Mark Ransom

First convert UTF-8 to 32-bit Unicode.

Then keep the values that are in the range 0 through 255.

Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.

The C++ standard library defines a std::codecvt specialization that can be used,

template<>
codecvt<char32_t, char, mbstate_t>

C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes”

answered Sep 21 '22 20:09

Cheers and hth. - Alf

Related questions
                            
                                What is the right way to handle char* strings?
                            
                                boost deadline_timer minimal example: should I substitute "sleep"?
                            
                                Why is my specialized template function invoked only in debug builds?
                            
                                How can I check whether a double has a fractional part?
                            
                                How do I cast `std::string` to `std::vector<unsigned char>` without making a copy?
                            
                                C++ linux: dlopen can't find .so library
                            
                                Why should I use "const" in my catch block?
                            
                                c++ function syntax/prototype - data type after brackets
                            
                                Layers on QGraphicsView?
                            
                                practice and discovery of Boost Type Erasure
                            
                                How can I create a C++ basic type that self-initializes?
                            
                                Why is this cast ambiguous?
                            
                                Why is there memory leak while using shared_ptr as a function parameter?
                            
                                Cannot use std::iota with std::set
                            
                                std::ratio power of a std::ratio at compile-time?
                            
                                Error after trying to get unique_ptr element from vector
                            
                                How do I avoid re-including <iostream> in multiple files?
                            
                                Performance gain from static, const and global variables [closed]
                            
                                Arm NEON and poly8_t and poly16_t
                            
                                How do C++ streams work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert string from UTF-8 to ISO-8859-1

Tags:

c++

utf-8

iso-8859-1

iconv

Chris Redford

People also ask

2 Answers

Mark Ransom

Cheers and hth. - Alf

Recent Activity

Donate For Us