Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2)
will return "?"
rather than "Ø"
.
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?
You cannot delete the first character of a string. Returns the substring of s that starts at i and continues until j; i and j can be negative. If j is absent, then it is assumed to be equal to -1 (which is the same as the string length)....
lua] supports all 5.3 string functions for UTF-8. Tested on Lua 5.1 and Lua 5.3 and LuaJIT. [ustring] provides a pure-Lua implementation of a UTF-8 version of each of the functions in the string library except string.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
In Lua, the string. len() function is used to get the size of a string.
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes
to get each code point, and then use utf8.char
to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern
is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*"
for the pattern to match one UTF-8 byte sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With