Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating through string fails with StringIndexError

Tags:

string

julia

Note: This question was motivated by this discourse thread.

Consider the following example string:

str = "This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!"

Trying to iterate through this string using its length:

for i in 1:length(str)
  println(i, str[i])
end

Fails with an StringIndexError, returning the following message half-way through the loop:

ERROR: StringIndexError("This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!", 70)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:217
 [3] getindex(::String, ::Int64) at ./strings/string.jl:210
 [4] top-level scope at ./REPL[4]:2

What exactly is the reason for this behavior?

like image 651
Wolf Avatar asked Mar 03 '23 09:03

Wolf


1 Answers

Strings in Julia fully support the UTF-8 encoding standard for Unicode characters. This, however makes the encoding size of a single character variable, depending on the character.

Standard ASCII characters (code points less than 128) use one byte and produce the expected behavior during the iteration. However, because the em-dash is part of the extended ASCII character set, it produces an error when trying to index using a uniform step size. More on strings and their behavior can be found in the documentation (specifically the section "Unicode and UTF-8").

Edit: As Stefan mentioned in the comments, note that length(str) behaves in the expected way and returns the actual number of characters in the string. The last index position can be retrieved via lastindex(str).

This error can be circumvented in multiple ways, depending on the desired behavior:

Option 1: Iterating over the string elements directly
If the index is not relevant, this is the easiest way to go about it:

for c in str
  println(c)
end

Option 2: Using eachindex to extract the correct string indices
If the actual index position in the string is relevant, one can do:

for bi in eachindex(str)
  println(bi, str[bi])
end

Option 3: Using enumerate to get linear index positions and characters
If the "character" index (i.e. the index/number of the current character, not its byte index) into the string and the corresponding character are relevant:

for (ci, c) in enumerate(str)
  println(ci, c)
end

Edit 2: Added a small example to clarify. Using the string str = "a ∀ x ∃ y" as an example.

Option 1 returns:

julia> for c in str; print(c, " | "); end
a |   | ∀ |   | x |   | ∃ |   | y | 

Option 2 returns:

julia> for bi in eachindex(str); print(bi, " ", str[bi], " | "); end
1 a | 2   | 3 ∀ | 6   | 7 x | 8   | 9 ∃ | 12   | 13 y |

Note e.g. the jump from 3 -> 6

Option 3 returns:

julia> for (ci, c) in enumerate(str); print(ci, " ", c, " | "); end
1 a | 2   | 3 ∀ | 4   | 5 x | 6   | 7 ∃ | 8   | 9 y | 
like image 130
Wolf Avatar answered Mar 16 '23 09:03

Wolf