Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 string array using graphemes vs split

Is there an advantage to using graphemes over split for creating an array from a UTF-8 string?

For example, consider the following:

# Define a UTF-8 string with a bunch of multibyte characters
s = "{(-n↑⍵÷⊃⊖⍵),⍨⍉1↓⍉∘.=⍨⍳n←1-⍨≢⍵}"

# Create an array using split
split(s, "")

# Create an array using graphemes (v0.4+)
collect(graphemes(s))

Both approaches produce the expected output. And indeed,

split(s, "") == collect(graphemes(s))

returns true.

The two approaches seem to consistently produce equivalent results. Is one approach generally preferred over another, be it for performance, style, or otherwise?

(Note that graphemes returns an iterator rather than an array, hence the collect.)

like image 275
Alex A. Avatar asked Mar 15 '23 13:03

Alex A.


1 Answers

Depends on what you are looking for. graphemes() will return what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme. This is not the case with split().

Consider a + ◌́ . In this example, split() will return the two codepoints as separate characters whereas graphemes() will return a single character.

like image 163
Josh Durham Avatar answered Mar 20 '23 17:03

Josh Durham