Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to write a lua pattern for words with umlauts

Words like "Annähren", "Überbringen", "Malmö" are not catched by

for w in string.gmatch(str, "%w+") do
     print(w) 
end

Any solution? thanks!

like image 788
sunmils Avatar asked Oct 04 '22 00:10

sunmils


2 Answers

The Lua string library does not intrinsically support any character encoding other than ASCII, and assumes all characters are 1 byte. While lua strings are 8-bit clean, this means that functions like string.sub expect offsets in bytes even in multi-byte character encodings, and functions like string.match will not behave as expected with non-ASCII encodings. It is worth reading the wiki page on Unicode in Lua, much of which also applies to other non-ASCII character encodings.

For your issue in particular, 'ö' is (in, for example, UTF-8) encoded as the two bytes C3 B6, which means that it will not be recognized by '%w' (which looks for characters in the a-z range, and has no concept of characters spanning multiple bytes). '[\xc3\xb6]+' will match it, but will also match a lot of other things, not all of which are even valid UTF-8 - and using '[ö]' has the same issue, as lua will interpret it as the same thing (a sequence of two bytes rather than a single character). If you are not using UTF-8, the specifics are different, but the basic problem remains the same.

The wiki page links a number of UTF-8 aware string library implementations for lua, such as slnunicode. Other encodings do not appear to be widely used by the community, so if you are using an encoding other than UTF-8, your best bet may to be convert to UTF-8 and then use that library or another like it.

like image 131
ToxicFrog Avatar answered Oct 10 '22 02:10

ToxicFrog


You may try the following:

local str = "Annähren, Überbringen, Malmö"
for w in string.gmatch(str, "[%w\128-\244]+") do
  print(w) 
end

It's not strictly correct as it ignores some UTF-8 combinations, but it may work for you. This SO answer and this post on validating UTF-8 may be useful.

like image 39
Paul Kulchenko Avatar answered Oct 10 '22 01:10

Paul Kulchenko