Note: This question was motivated by this discourse thread.
Consider the following example string:
str = "This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!"
Trying to iterate through this string using its length:
for i in 1:length(str)
println(i, str[i])
end
Fails with an StringIndexError
, returning the following message half-way through the loop:
ERROR: StringIndexError("This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!", 70)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:217
[3] getindex(::String, ::Int64) at ./strings/string.jl:210
[4] top-level scope at ./REPL[4]:2
What exactly is the reason for this behavior?
Strings in Julia fully support the UTF-8 encoding standard for Unicode characters. This, however makes the encoding size of a single character variable, depending on the character.
Standard ASCII characters (code points less than 128) use one byte and produce the expected behavior during the iteration. However, because the em-dash —
is part of the extended ASCII character set, it produces an error when trying to index using a uniform step size. More on strings and their behavior can be found in the documentation (specifically the section "Unicode and UTF-8").
Edit: As Stefan mentioned in the comments, note that length(str)
behaves in the expected way and returns the actual number of characters in the string. The last index position can be retrieved via lastindex(str)
.
This error can be circumvented in multiple ways, depending on the desired behavior:
Option 1: Iterating over the string elements directly
If the index is not relevant, this is the easiest way to go about it:
for c in str
println(c)
end
Option 2: Using eachindex
to extract the correct string indices
If the actual index position in the string is relevant, one can do:
for bi in eachindex(str)
println(bi, str[bi])
end
Option 3: Using enumerate
to get linear index positions and characters
If the "character" index (i.e. the index/number of the current character, not its byte index) into the string and the corresponding character are relevant:
for (ci, c) in enumerate(str)
println(ci, c)
end
Edit 2: Added a small example to clarify.
Using the string str = "a ∀ x ∃ y"
as an example.
Option 1 returns:
julia> for c in str; print(c, " | "); end
a | | ∀ | | x | | ∃ | | y |
Option 2 returns:
julia> for bi in eachindex(str); print(bi, " ", str[bi], " | "); end
1 a | 2 | 3 ∀ | 6 | 7 x | 8 | 9 ∃ | 12 | 13 y |
Note e.g. the jump from 3 -> 6
Option 3 returns:
julia> for (ci, c) in enumerate(str); print(ci, " ", c, " | "); end
1 a | 2 | 3 ∀ | 4 | 5 x | 6 | 7 ∃ | 8 | 9 y |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With