UTF-8 string array using graphemes vs split

Question

Is there an advantage to using graphemes over split for creating an array from a UTF-8 string?

For example, consider the following:

# Define a UTF-8 string with a bunch of multibyte characters
s = "{(-n↑⍵÷⊃⊖⍵),⍨⍉1↓⍉∘.=⍨⍳n←1-⍨≢⍵}"

# Create an array using split
split(s, "")

# Create an array using graphemes (v0.4+)
collect(graphemes(s))

Both approaches produce the expected output. And indeed,

split(s, "") == collect(graphemes(s))

returns true.

The two approaches seem to consistently produce equivalent results. Is one approach generally preferred over another, be it for performance, style, or otherwise?

(Note that graphemes returns an iterator rather than an array, hence the collect.)

Josh Durham · Accepted Answer

Depends on what you are looking for. graphemes() will return what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme. This is not the case with split().

Consider a + ◌́ . In this example, split() will return the two codepoints as separate characters whereas graphemes() will return a single character.

UTF-8 string array using graphemes vs split

Tags:

arrays

string

utf-8

julia

Alex A.

1 Answers

Josh Durham

Recent Activity

Donate For Us

UTF-8 string array using graphemes vs split

Tags:

arrays

string

utf-8

julia

Alex A.

1 Answers

Josh Durham

Related questions

Recent Activity

Donate For Us