First question. What's the easiest way in Lua to determine if the last character in a string is not multibyte. Or what's the easiest way to delete the last character from a string.
Here are examples of valid strings, and what I want the output of the function to be
hello there --- result should be: hello ther
anñ --- result should be: an
כראע --- result should be: כרא
ㅎㄹㅇㅇㅅ --- result should be: ㅎㄹㅇㅇ
I need something like
function lastCharacter(string)
--- some code which will extract the last character only ---
return lastChar
end
or if it's easier
function deleteLastCharacter(string)
--- some code which will output the string minus the last character ---
return newString
end
This is the path I was going on
local function lastChar(string)
local stringLength = string.len(string)
local lastc = string.sub(string,stringLength,stringLength)
if lastc is a multibyte character then
local wordTable = {}
for word in string:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
wordTable[#wordTable+1] = word
end
lastc = wordTable[#wordTable]
end
return lastc
end
First of all, note that there are no functions in Lua's string
library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, string.len
will give you the number of bytes, not the number of characters. And string.sub
will give you a substring of bytes not a substring of characters.
Some UTF-8 basics:
If you need some refreshing on the conceptual basics of Unicode, you should check out this article.
UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:
[0,127]
, it's a single-byte (ASCII) character[128,191]
, it's part of a longer character and meaningless on its own[191,244]
, it marks the beginning of a longer character (and tells us how long that character is)This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.
Some pattern matching basics:
For the task at hand we need a few of Lua's pattern matching constructs:
[...]
is a character class, that matches a single character (or rather byte) of those inside the class. E.g. [abc]
matches either a
, or b
or c
. You can define ranges using a hyphen. Therefore [\33-\127]
for example, matches any single one of the bytes from 33
to 127
. Note that \127
is an escape sequence you can use in any Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, "a"
is the same as "\97"
.
You can negate a character class, by starting it with ^
(so that it matches any single byte that is not part of the class.
*
repeats the previous token 0 or more times (arbitrarily many times - as often as possible).
$
is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.
Combining all of that...
...your problem reduces to a one-liner:
local function lastChar(s)
return string.match(s, "[^\128-\191][\128-\191]*$")
end
This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string ($
). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.
Equivalently, you can use gsub
to remove that last character from your string:
function deleteLastCharacter(s)
return string.gsub(s, "[^\128-\191][\128-\191]*$", "")
end
The match is the same, but instead of returning the matched substring, we replace it with ""
(i.e. remove it) and return the modified string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With