Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect if last character is not multibyte in Lua

First question. What's the easiest way in Lua to determine if the last character in a string is not multibyte. Or what's the easiest way to delete the last character from a string.

Here are examples of valid strings, and what I want the output of the function to be

hello there     --- result should be:   hello ther
anñ             --- result should be:   an
כראע            --- result should be:   כרא
ㅎㄹㅇㅇㅅ       --- result should be:   ㅎㄹㅇㅇ

I need something like

function lastCharacter(string)
    --- some code which will extract the last character only ---
    return lastChar
end

or if it's easier

function deleteLastCharacter(string)
--- some code which will output the string minus the last character --- 
    return newString
end

This is the path I was going on

local function lastChar(string)
    local stringLength = string.len(string)
    local lastc = string.sub(string,stringLength,stringLength)
    if lastc is a multibyte character then
        local wordTable = {}
        for word in string:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
            wordTable[#wordTable+1] = word
        end
    lastc = wordTable[#wordTable]
end
    return lastc
end
like image 952
fun_programming Avatar asked Dec 04 '22 11:12

fun_programming


1 Answers

First of all, note that there are no functions in Lua's string library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, string.len will give you the number of bytes, not the number of characters. And string.sub will give you a substring of bytes not a substring of characters.

Some UTF-8 basics:

If you need some refreshing on the conceptual basics of Unicode, you should check out this article.

UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:

  • If it's in [0,127], it's a single-byte (ASCII) character
  • If it's in [128,191], it's part of a longer character and meaningless on its own
  • If it's in [191,244], it marks the beginning of a longer character (and tells us how long that character is)

This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.

Some pattern matching basics:

For the task at hand we need a few of Lua's pattern matching constructs:

[...] is a character class, that matches a single character (or rather byte) of those inside the class. E.g. [abc] matches either a, or b or c. You can define ranges using a hyphen. Therefore [\33-\127] for example, matches any single one of the bytes from 33 to 127. Note that \127 is an escape sequence you can use in any Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, "a" is the same as "\97".

You can negate a character class, by starting it with ^ (so that it matches any single byte that is not part of the class.

* repeats the previous token 0 or more times (arbitrarily many times - as often as possible).

$ is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.

Combining all of that...

...your problem reduces to a one-liner:

local function lastChar(s)
    return string.match(s, "[^\128-\191][\128-\191]*$")
end

This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string ($). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.

Equivalently, you can use gsub to remove that last character from your string:

function deleteLastCharacter(s)
    return string.gsub(s, "[^\128-\191][\128-\191]*$", "")
end

The match is the same, but instead of returning the matched substring, we replace it with "" (i.e. remove it) and return the modified string.

like image 62
Martin Ender Avatar answered Dec 20 '22 09:12

Martin Ender