Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Vim have an equivalent to \X to match Unicode "grapheme clusters"?

Tags:

regex

vim

unicode

Unicode specifies that \X should match an "extened grapheme cluster" - for instance a base character followed by zero or more combining characters. (I believe this is a simplification but may suffice for my needs.)

I'm pretty sure at least Perl supports \X in its regular expresions.

But Vim defines \X to match a non-hexadecimal digit.

Does Vim have any equivalent to \X or any way to match a Unicode extended grapheme cluster?

Vim does have a concept of combining or "composing" characters, but its documentation does not cover whether or how they are supported in regular expressions.

It seems that Vim does not yet support this directly, but I am still interested in a workaround where a search will highlight all characters which include a combining character in at least the most basic range of U+0300 to U+0364.

like image 594
hippietrail Avatar asked Jun 07 '12 12:06

hippietrail


2 Answers

You can search for all characters and ignore composing characters with \Z. Or you can search for a range of Unicode characters. Read :help /[] from more information on both.

The last post here may offer some more help:

http://vim.1045645.n5.nabble.com/using-regexp-to-search-for-Unicode-code-points-and-properties-td1190333.html

But Vim's regex does not have a character class like Perl.

like image 112
embedded.kyle Avatar answered Oct 26 '22 19:10

embedded.kyle


If your vim installation is compiled with perl support, you may be able to run:

:perldo s/\X/replacement/g

I installed vim-nox on debian (which contains perl support), and matching \X in with perldo does indeed work, but I'm not sure it will do what you want, since all normal characters are also matched and it doesn't seem like perldo will get you highlighting in vim.

While it's not perfect, if you can get perl support, you can use unicode blocks and categories. Which means you can use \p{Block: Combining_Diacritical_Marks} or \p{Category: Nonspacing_Mark} to at least detect certain characters, though you still won't get highlighting.

like image 45
beerbajay Avatar answered Oct 26 '22 17:10

beerbajay