Is there an advantage to using graphemes
over split
for creating an array from a UTF-8 string?
For example, consider the following:
# Define a UTF-8 string with a bunch of multibyte characters
s = "{(-n↑⍵÷⊃⊖⍵),⍨⍉1↓⍉∘.=⍨⍳n←1-⍨≢⍵}"
# Create an array using split
split(s, "")
# Create an array using graphemes (v0.4+)
collect(graphemes(s))
Both approaches produce the expected output. And indeed,
split(s, "") == collect(graphemes(s))
returns true
.
The two approaches seem to consistently produce equivalent results. Is one approach generally preferred over another, be it for performance, style, or otherwise?
(Note that graphemes
returns an iterator rather than an array, hence the collect
.)
Depends on what you are looking for. graphemes()
will return what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme. This is not the case with split()
.
Consider a + ◌́ . In this example, split()
will return the two codepoints as separate characters whereas graphemes()
will return a single character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With