Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split Unicode entities by graphemes

"d̪".chars.to_a

gives me

["d"," ̪"]

How do I get Ruby to split it by graphemes?

["d̪"]
like image 516
Reactormonk Avatar asked Apr 22 '26 13:04

Reactormonk


1 Answers

Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.


In Ruby 2.0 or above you can use str.scan /\X/

> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]

# Let's get crazy:


> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'


> str.length
=> 75
> str.scan(/\X/).length
=> 6

If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:

> "d̪".split /(?=\X)/
=> ["d̪"]

ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:

ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }
like image 190
Inkling Avatar answered Apr 25 '26 10:04

Inkling



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!