I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string: <pre class="prettyprint"><code>local str = "KORYTNAČKA" for c in str:gmatch("%a") do print(c) end --result: K,O,R,Y,T,N,A,K,A </code></pre> I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird: <pre class="prettyprint"><code>local str = "KORYTNAČKA" for c in str:gmatch("["..str.."]") do print(c) end --result: K,O,R,Y,T,N,A,Ä,&OElig;,K,A </code></pre> Anyone know how to solve this?

Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern <code>"%a"</code> matches one-byte character, so the result is not what you expected. The pattern <code>"["..str.."]"</code> works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character. <hr> If UTF-8 is used, you can use the pattern <code>"[\0-\x7F\xC2-\xF4][\x80-\xBF]*"</code> to match a single UTF-8 byte sequence in Lua 5.2, like this: <pre class="prettyprint"><code>local str = "KORYTNAČKA" for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do print(c) end </code></pre> In Lua 5.1(which is the version Corona SDK is using), use this: <pre class="prettyprint"><code>local str = "KORYTNAČKA" for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do print(c) end </code></pre> For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.

Lua gmatch odd characters (Slovak alphabet)

Tags:

string

unicode

lua

coronasdk

lua-patterns

I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:

local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A

I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:

local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A

Anyone know how to solve this?

947

asked Apr 09 '14 06:04

Omid Ahourai

2 Answers

Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a" matches one-byte character, so the result is not what you expected.

The pattern "["..str.."]" works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.

If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence in Lua 5.2, like this:

local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do 
    print(c) 
end

In Lua 5.1(which is the version Corona SDK is using), use this:

local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do 
    print(c) 
end

For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.

112

answered Nov 15 '22 18:11

Yu Hao

Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ is a 2 bytes representing UTF-8 encoding of a Č character.

Yu Hao already provided sample solution, but for more details here is good source.

I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub function, see sample.

answered Nov 15 '22 18:11

Petr Abdulin

Related questions
                            
                                Pandas weird behavior using .replace() to swap values
                            
                                Can multiple BASH variable manipulations be used at once? [duplicate]
                            
                                How can I define a custom alphabet order for comparing and sorting strings in go?
                            
                                R - Longest common substring
                            
                                Convert a lisp string to stream
                            
                                Short, Java implementation of a suffix tree and usage?
                            
                                Problem comparing French character Î
                            
                                Replicate the functionality of Java's "Pattern.quote" in a JavaScript RegExp [duplicate]
                            
                                PHP equivalent to Ruby symbol
                            
                                WPF String Format Question
                            
                                How to pass dynamic array of string to a dll library (dll and client written in d7) without ShareMem Unit?
                            
                                Finding an Insertion in a String
                            
                                Secure password encryption/decryption of strings in PHP
                            
                                Finding a string *and* its substrings in a haystack
                            
                                Parsing string with KB/MB/GB etc into numeric value
                            
                                Strong-typing for strings or other sealed classes
                            
                                How to map a byte[] property with Hibernate?
                            
                                Python - Evaluate math expression within string [duplicate]
                            
                                Case sensitive order using Java Collator
                            
                                Meaning of $ in a string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With