Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More elegant, simpler way to convert code point to UTF-8

Tags:

utf-8

lua

For this question I created the following Lua code that converts a Unicode code point to a UTF-8 character string. Is there a better way to do this (in Lua 5.1+)? "Better" in this case means "drastically more efficient, or—preferably—far fewer lines of code".

Note: I'm not really asking for a code review of this algorithm; I'm asking for a better algorithm (or built-in library).

do
  local bytebits = {
    {0x7F,{0,128}},
    {0x7FF,{192,32},{128,64}},
    {0xFFFF,{224,16},{128,64},{128,64}},
    {0x1FFFFF,{240,8},{128,64},{128,64},{128,64}}
  }
  function utf8(decimal)
    local charbytes = {}
    for b,lim in ipairs(bytebits) do
      if decimal<=lim[1] then
        for i=b,1,-1 do
          local prefix,max = lim[i+1][1],lim[i+1][2]
          local mod = decimal % max
          charbytes[i] = string.char( prefix + mod )
          decimal = ( decimal - mod ) / max
        end
        break
      end
    end
    return table.concat(charbytes)
  end
end

c=utf8(0x24)     print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c=utf8(0xA2)     print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c=utf8(0x20AC)   print(c.." is "..#c.." bytes.") --> € is 3 bytes.  
c=utf8(0xFFFF)   print(c.." is "..#c.." bytes.") -->  is 3 bytes.
c=utf8(0x10000)  print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c=utf8(0x24B62)  print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.   

I feel like there ought to be a way to get rid of the whole bytebits predefined table and loop just to find the matching entry. Looping from the back I could continually %64 and add 128 to form the continuation bytes until the value was below 128, but I can't figure out how to elegantly generate the 0/110/1110/11110 preamble to add on.


Edit: Here's a slightly better reworking, with a speed optimization. This is not an acceptable answer, though, since the algorithm is still basically the same idea and about the same amount of code.

do
  local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
  function utf8(decimal)
    if decimal<128 then return string.char(decimal) end
    local charbytes = {}
    for bytes,vals in ipairs(bytemarkers) do
      if decimal<=vals[1] then
        for b=bytes+1,2,-1 do
          local mod = decimal%64
          decimal = (decimal-mod)/64
          charbytes[b] = string.char(128+mod)
        end
        charbytes[1] = string.char(vals[2]+decimal)
        break
      end
    end
    return table.concat(charbytes)
  end
end
like image 715
Phrogz Avatar asked Sep 27 '14 03:09

Phrogz


2 Answers

If we're talking about speed, the usage pattern in a real world scenario is very important. But here, we're in a vacuum, so let's proceed anyway.

This algorithm is probably what you're looking for when you say you thing you ought to be able to get rid of bytebits:

do
  local string_char = string.char
  function utf8(cp)
    if cp < 128 then
      return string_char(cp)
    end
    local s = ""
    local prefix_max = 32
    while true do
      local suffix = cp % 64
      s = string_char(128 + suffix)..s
      cp = (cp - suffix) / 64
      if cp < prefix_max then
        return string_char((256 - (2 * prefix_max)) + cp)..s
      end
      prefix_max = prefix_max / 2
    end
  end
end

It also includes some other optimizations which aren't particularly interesting, and for me is about 2x as fast as your optimized given code. (As a bonus, it should work all the way up to U+7FFFFFFF as well.)

If we want to micro-optimize even more, the loop can be unrolled to:

do
  local string_char = string.char
  function utf8_unrolled(cp)
    if cp < 128 then
      return string_char(cp)
    end
    local suffix = cp % 64
    local c4 = 128 + suffix
    cp = (cp - suffix) / 64
    if cp < 32 then
      return string_char(192 + cp, c4)
    end
    suffix = cp % 64
    local c3 = 128 + suffix
    cp = (cp - suffix) / 64
    if cp < 16 then
      return string_char(224 + cp, c3, c4)
    end
    suffix = cp % 64
    cp = (cp - suffix) / 64
    return string_char(240 + cp, 128 + suffix, c3, c4)
  end
end

This is about 5x as fast as your optimized code, but wholly inelegant. I think the main gains are not having to store intermediate results on the heap and having fewer function calls.

However, the fastest (as far as I can find) approach is not to do the calculation at all:

do
  local lookup = {}
  for i=0,0x1FFFFF do
    lookup[i]=calculate_utf8(i)
  end  
  function utf8(cp)
    return lookup[cp]
  end
end

This is about 30x as fast as your optimized code which may qualify as "drastically more efficient" (although the memory usage is ridiculous). However, it is also not interesting. (A good compromise in some cases would be to use memoization.)

Of course, any pure c implementation is likely to be faster than any calculation done in Lua.

like image 65
tehtmi Avatar answered Nov 15 '22 07:11

tehtmi


Lua 5.3 provides a basic UTF-8 library, among which the function utf8.char is what you are looking for:

Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.

c = utf8.char(0x24)     print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c = utf8.char(0xA2)     print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c = utf8.char(0x20AC)   print(c.." is "..#c.." bytes.") --> € is 3 bytes.  
c = utf8.char(0xFFFF)   print(c.." is "..#c.." bytes.") -->  is 3 bytes.
c = utf8.char(0x10000)  print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c = utf8.char(0x24B62)  print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
like image 22
Yu Hao Avatar answered Nov 15 '22 09:11

Yu Hao