For this question I created the following Lua code that converts a Unicode code point to a UTF-8 character string. Is there a better way to do this (in Lua 5.1+)? "Better" in this case means "drastically more efficient, or—preferably—far fewer lines of code".
Note: I'm not really asking for a code review of this algorithm; I'm asking for a better algorithm (or built-in library).
do
local bytebits = {
{0x7F,{0,128}},
{0x7FF,{192,32},{128,64}},
{0xFFFF,{224,16},{128,64},{128,64}},
{0x1FFFFF,{240,8},{128,64},{128,64},{128,64}}
}
function utf8(decimal)
local charbytes = {}
for b,lim in ipairs(bytebits) do
if decimal<=lim[1] then
for i=b,1,-1 do
local prefix,max = lim[i+1][1],lim[i+1][2]
local mod = decimal % max
charbytes[i] = string.char( prefix + mod )
decimal = ( decimal - mod ) / max
end
break
end
end
return table.concat(charbytes)
end
end
c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c=utf8(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes.
c=utf8(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
I feel like there ought to be a way to get rid of the whole bytebits
predefined table and loop just to find the matching entry. Looping from the back I could continually %64
and add 128
to form the continuation bytes until the value was below 128, but I can't figure out how to elegantly generate the 0
/110
/1110
/11110
preamble to add on.
Edit: Here's a slightly better reworking, with a speed optimization. This is not an acceptable answer, though, since the algorithm is still basically the same idea and about the same amount of code.
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
If we're talking about speed, the usage pattern in a real world scenario is very important. But here, we're in a vacuum, so let's proceed anyway.
This algorithm is probably what you're looking for when you say you thing you ought to be able to get rid of bytebits:
do
local string_char = string.char
function utf8(cp)
if cp < 128 then
return string_char(cp)
end
local s = ""
local prefix_max = 32
while true do
local suffix = cp % 64
s = string_char(128 + suffix)..s
cp = (cp - suffix) / 64
if cp < prefix_max then
return string_char((256 - (2 * prefix_max)) + cp)..s
end
prefix_max = prefix_max / 2
end
end
end
It also includes some other optimizations which aren't particularly interesting, and for me is about 2x as fast as your optimized given code. (As a bonus, it should work all the way up to U+7FFFFFFF as well.)
If we want to micro-optimize even more, the loop can be unrolled to:
do
local string_char = string.char
function utf8_unrolled(cp)
if cp < 128 then
return string_char(cp)
end
local suffix = cp % 64
local c4 = 128 + suffix
cp = (cp - suffix) / 64
if cp < 32 then
return string_char(192 + cp, c4)
end
suffix = cp % 64
local c3 = 128 + suffix
cp = (cp - suffix) / 64
if cp < 16 then
return string_char(224 + cp, c3, c4)
end
suffix = cp % 64
cp = (cp - suffix) / 64
return string_char(240 + cp, 128 + suffix, c3, c4)
end
end
This is about 5x as fast as your optimized code, but wholly inelegant. I think the main gains are not having to store intermediate results on the heap and having fewer function calls.
However, the fastest (as far as I can find) approach is not to do the calculation at all:
do
local lookup = {}
for i=0,0x1FFFFF do
lookup[i]=calculate_utf8(i)
end
function utf8(cp)
return lookup[cp]
end
end
This is about 30x as fast as your optimized code which may qualify as "drastically more efficient" (although the memory usage is ridiculous). However, it is also not interesting. (A good compromise in some cases would be to use memoization.)
Of course, any pure c implementation is likely to be faster than any calculation done in Lua.
Lua 5.3 provides a basic UTF-8 library, among which the function utf8.char
is what you are looking for:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
c = utf8.char(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c = utf8.char(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c = utf8.char(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c = utf8.char(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes.
c = utf8.char(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c = utf8.char(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With