Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a multibyte string in Lua

I have a multibyte string in Lua.

local s = "あいうえお"

How do I take the string and split it into a table of strings?

In English texts, then I can use this code. But this does not work with the multibyte.

local s = "foo bar 123"
local words = {}
for word in s:gmatch("%w+") do
    table.insert( words, word )
end
like image 980
user1169307 Avatar asked Nov 22 '25 10:11

user1169307


1 Answers

As others have noted, it's hard to tell what you want to do: where do you want to split for non-ASCII characters, if splitting at spaces doesn't suffice?

If you just want to split between individual characters for non-ASCII characters, something like the following may suffice:

s = "oink barf 頑張っています"
for word in s:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
   print (word)
end

produces:

oink
barf
頑
張
っ
て
い
ま
す

The trick here is that in UTF-8, multi-byte characters each consist of a "lead byte" with the top two bits equal to 11 (so \192\255 in Lua—remember, character escapes in Lua are decimal), followed by zero or more "trailing bytes" with the top two bits equal to 10 (\128\191 in Lua).

like image 178
snogglethorpe Avatar answered Nov 24 '25 01:11

snogglethorpe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!