Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove the last character of a UTF-8 string in C++?

The text is stored in a std::string.

If the text is 8-bit ASCII, then it is really easy:

text.pop_back();

But what if it is UTF-8 text?
As far as I know, there are no UTF-8 related functions in the standard library which I could use.

like image 349
Iter Ator Avatar asked Jun 03 '16 21:06

Iter Ator


People also ask

How can I remove last character from a string in C?

Every string in C ends with '\0'. So you need do this: int size = strlen(my_str); //Total size of string my_str[size-1] = '\0'; This way, you remove the last char.

How can I remove last character from STD string?

Use pop_back() Function to Remove Last Character From the String in C++ The pop_back() is a built-in function in C++ STL that removes the last element from a string. It simply deletes the last element and adjusts the length of the string accordingly.

How do you reference the last character of a string?

To get the last character of a string, use bracket notation to access the string at the last index, e.g. str[str. length - 1] . Indexes are zero-based, so the index of the last character in the string is str.


2 Answers

You really need a UTF-8 Library if you are going to work with UTF-8. However for this task I think something like this may suffice:

void pop_back_utf8(std::string& utf8)
{
    if(utf8.empty())
        return;

    auto cp = utf8.data() + utf8.size();
    while(--cp >= utf8.data() && ((*cp & 0b10000000) && !(*cp & 0b01000000))) {}
    if(cp >= utf8.data())
        utf8.resize(cp - utf8.data());
}

int main()
{
    std::string s = "κόσμε";

    while(!s.empty())
    {
        std::cout << s << '\n';
        pop_back_utf8(s);
    }
}

Output:

κόσμε
κόσμ
κόσ
κό
κ

It relies on the fact that UTF-8 Encoding has one start byte followed by several continuation bytes. Those continuation bytes can be detected using the provided bitwise operators.

like image 159
Galik Avatar answered Oct 11 '22 05:10

Galik


What you can do is pop off characters until you reach the leading byte of a code point. The leading byte of a code point in UTF8 is either of the pattern 0xxxxxxx or 11xxxxxx, and all non-leading bytes are of the form 10xxxxxx. This means you can check the first and second bit to determine if you have a leading byte.

bool is_leading_utf8_byte(char c) {
    auto first_bit_set = (c & 0x80) != 0;
    auto second_bit_set = (c & 0X40) != 0;
    return !first_bit_set || second_bit_set;
}

void pop_utf8(std::string& x) {
    while (!is_leading_utf8_byte(x.back()))
        x.pop_back();
    x.pop_back();
}

This of course does no error checking and assumes that your string is valid utf-8.

like image 24
Benjamin Lindley Avatar answered Oct 11 '22 04:10

Benjamin Lindley