Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract the first letter of a UTF-8 string with Lua

Tags:

unicode

utf-8

lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?

Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".

Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?

Or is this way too complex, requiring a huge library, etc.?

like image 768
forthrin Avatar asked Nov 05 '12 15:11

forthrin


People also ask

How do I remove the first letter of a string in Lua?

You cannot delete the first character of a string. Returns the substring of s that starts at i and continues until j; i and j can be negative. If j is absent, then it is assumed to be equal to -1 (which is the same as the string length)....

Does Lua support UTF-8?

lua] supports all 5.3 string functions for UTF-8. Tested on Lua 5.1 and Lua 5.3 and LuaJIT. [ustring] provides a pure-Lua implementation of a UTF-8 version of each of the functions in the string library except string.

Does Lua support Unicode?

Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.

How do I find the length of a string in Lua?

In Lua, the string. len() function is used to get the size of a string.


2 Answers

You can easily extract the first letter from a UTF-8 encoded string with the following code:

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.

You can even iterate over UTF-8 code points in a similar manner:

for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
  print(code)
end

Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

like image 68
prapin Avatar answered Sep 20 '22 10:09

prapin


Lua 5.3 provide a UTF-8 library.

You can use utf8.codes to get each code point, and then use utf8.char to get the character:

local str = "ÆØÅ"
for _, c in utf8.codes(str) do
  print(utf8.char(c))
end

This also works:

local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
  print(w)
end

where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

like image 43
Yu Hao Avatar answered Sep 19 '22 10:09

Yu Hao