Is there any way to extract the first letter of a UTF-8 encoded string with Lua? Lua does not properly support Unicode, so <code>string.sub("ÆØÅ", 2, 2)</code> will return <code>"?"</code> rather than <code>"Ø"</code>. Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A? Or is this way too complex, requiring a huge library, etc.?

Lua 5.3 provide a UTF-8 library. You can use <code>utf8.codes</code> to get each code point, and then use <code>utf8.char</code> to get the character: <pre class="prettyprint"><code>local str = "ÆØÅ" for _, c in utf8.codes(str) do print(utf8.char(c)) end </code></pre> <hr> This also works: <pre class="prettyprint"><code>local str = "ÆØÅ" for w in str:gmatch(utf8.charpattern ) do print(w) end </code></pre> where <code>utf8.charpattern</code> is just the string <code>"[\0-\x7F\xC2-\xF4][\x80-\xBF]*"</code> for the pattern to match one UTF-8 byte sequence.

Extract the first letter of a UTF-8 string with Lua

2 Answers

You can easily extract the first letter from a UTF-8 encoded string with the following code:

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.

You can even iterate over UTF-8 code points in a similar manner:

for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
  print(code)
end

Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

answered Sep 20 '22 10:09

prapin

Lua 5.3 provide a UTF-8 library.

You can use utf8.codes to get each code point, and then use utf8.char to get the character:

local str = "ÆØÅ"
for _, c in utf8.codes(str) do
  print(utf8.char(c))
end

This also works:

local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
  print(w)
end

where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

answered Sep 19 '22 10:09

Yu Hao

Related questions
                            
                                How do I match only fully-composed characters in a Unicode string in Perl?
                            
                                How to handle UTF-8 email headers (like Subject:) using Ruby?
                            
                                How do I convert NSString to NSData?
                            
                                Differentiate between TCHAR and _TCHAR
                            
                                Compile Syntax Error: non ASCII letters in a string
                            
                                Python unicode normalization: is it correct to translate u'\xb4' to u' \u0301'
                            
                                How do I match unicode characters in Java
                            
                                How can I specify Cyrillic character ranges in a Python 3.2 regex?
                            
                                Python what's the difference between str(u'a') and u'a'.encode('utf-8')
                            
                                What are the experiences with using unicode in identifiers
                            
                                Best Type for UTF-8 data?
                            
                                String In python with my unicode?
                            
                                Does development with scalaz require an Unicode/APL-like keyboard?
                            
                                Hex String to Character in PURE Swift
                            
                                What are the best practices for handling Unicode strings in C#? [closed]
                            
                                Unicode BOM for UTF-16LE vs UTF32-LE
                            
                                JSON specifies "any UNICODE character"?
                            
                                National (non-Arabic) digits in Unicode?
                            
                                Inserting special characters (greater/less than or equal symbol) into SQL Server database
                            
                                Unicode in Python - just UTF-16?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract the first letter of a UTF-8 string with Lua

Tags:

unicode

utf-8

lua

forthrin

People also ask

2 Answers

prapin

Yu Hao

Recent Activity

Donate For Us