Words like "Annähren", "Überbringen", "Malmö" are not catched by
for w in string.gmatch(str, "%w+") do
print(w)
end
Any solution? thanks!
The Lua string library does not intrinsically support any character encoding other than ASCII, and assumes all characters are 1 byte. While lua strings are 8-bit clean, this means that functions like string.sub
expect offsets in bytes even in multi-byte character encodings, and functions like string.match
will not behave as expected with non-ASCII encodings. It is worth reading the wiki page on Unicode in Lua, much of which also applies to other non-ASCII character encodings.
For your issue in particular, 'ö' is (in, for example, UTF-8) encoded as the two bytes C3 B6
, which means that it will not be recognized by '%w'
(which looks for characters in the a-z range, and has no concept of characters spanning multiple bytes). '[\xc3\xb6]+'
will match it, but will also match a lot of other things, not all of which are even valid UTF-8 - and using '[ö]'
has the same issue, as lua will interpret it as the same thing (a sequence of two bytes rather than a single character). If you are not using UTF-8, the specifics are different, but the basic problem remains the same.
The wiki page links a number of UTF-8 aware string library implementations for lua, such as slnunicode. Other encodings do not appear to be widely used by the community, so if you are using an encoding other than UTF-8, your best bet may to be convert to UTF-8 and then use that library or another like it.
You may try the following:
local str = "Annähren, Überbringen, Malmö"
for w in string.gmatch(str, "[%w\128-\244]+") do
print(w)
end
It's not strictly correct as it ignores some UTF-8 combinations, but it may work for you. This SO answer and this post on validating UTF-8 may be useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With