Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is indexing of UTF8 strings discouraged in Julia?

The introductory guide to Julia, Learn Julia in Y Minutes, discourages users from indexing UTF8 strings:

# Some strings can be indexed like an array of characters
"This is a string"[1] # => 'T' # Julia indexes from 1
# However, this is will not work well for UTF8 strings,
# so iterating over strings is recommended (map, for loops, etc).

Why is iterating over such strings discouraged? What specifically about the structure of this alternate string type makes indexing error prone? Is this a Julia specific pitfall, or does this extend to all languages with UTF8 string support?

like image 944
David Shaked Avatar asked Feb 02 '16 16:02

David Shaked


People also ask

What is an invalid UTF-8 string?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.

What characters are not allowed in UTF-8?

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.

Is there not a utf8 encoding?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.


2 Answers

Just to expand upon Scott Jones' comment, Julia actually also offers fixed-width strings similar to the std::wstring from C++, which allows for convenient indexing. They are now in https://github.com/JuliaStrings/LegacyStrings.jl One needs to install the package first with Pkg.add("LegacyStrings").

UTF32String would be the best choice for most use cases. To construct an UTF32String from a normal string: s2 = utf32(s).

like image 168
xji Avatar answered Oct 12 '22 12:10

xji


Because in UTF8 a character is not always encoded in a single byte.

Take for example the german language string böse (evil). The bytes of this string in UTF8 encoding are:

0x62 0xC3 0xB6 0x73 0x65
b    ö         s    e

As you can see the umlaut ö requires 2 bytes.

Now if you directly index this UTF8 encoded string "böse"[4] will give you sand not e.

However, you can use the string as an iterable object in julia:

julia> for c in "böse"
           println(c)
       end
b
ö
s
e

And since you've asked, No, direct byte indexing issues with UTF8 strings are not specific to Julia.

Recommendation for further reading:
http://docs.julialang.org/en/release-0.4/manual/strings/#unicode-and-utf-8

like image 43
gollum Avatar answered Oct 12 '22 13:10

gollum