I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:
local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A
I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:
local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A
Anyone know how to solve this?
Lua Pattern matching The `gmatch` function gmatch function will take an input string and a pattern. This pattern describes on what to actually get back. This function will return a function which is actually an iterator. The result of this iterator will match to the pattern.
Strings have the usual meaning: a sequence of characters. Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros. That means that you can store any binary data into a string. Strings in Lua are immutable values.
Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a"
matches one-byte character, so the result is not what you expected.
The pattern "["..str.."]"
works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.
If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*"
to match a single UTF-8 byte sequence in Lua 5.2, like this:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
In Lua 5.1(which is the version Corona SDK is using), use this:
local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(c)
end
For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.
Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ
is a 2 bytes representing UTF-8 encoding of a Č
character.
Yu Hao already provided sample solution, but for more details here is good source.
I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub
function, see sample.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With