Lua unicode, using string.sub() with two-byted chars

Question

As example: I want remove the first 2 letters from the string "ПРИВЕТ" and "HELLO." one of these are containing only two-byted unicode symbols.

Trying to use string.sub("ПРИВЕТ") and string.sub("HELLO.")

Got "РИВЕТ" and "LLO.".

string.sub() removed 2 BYTES(not chars) from these strings. So i want to know how to get the removing of the chars

Something, like utf8.sub()

lhf · Accepted Answer

The key standard function for this task is utf8.offset(s,n), which gives the position in bytes of the start of the n-th character of s.

So try this:

print(string.sub(s,utf8.offset(s,3),-1))

You can define utf8.sub as follows:

function utf8.sub(s,i,j)
    i=utf8.offset(s,i)
    j=utf8.offset(s,j+1)-1
    return string.sub(s,i,j)
end

(This code only works for positive j. See http://lua-users.org/lists/lua-l/2014-04/msg00590.html for the general case.)

Donate For Us