Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex unicode character in vim

Tags:

regex

vim

unicode

I'm being an idiot.

Someone cut and pasted some text from microsoft word into my lovely html files.

I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)

I want to do a regex replace but I'm having trouble selecting them.

:%s/\u92/'/g :%s/\u5C/'/g :%s/\x92/'/g :%s/\x5C/'/g 

...all fail. My google-fu has failed me.

like image 692
aidan Avatar asked Jun 10 '10 17:06

aidan


People also ask

Does Vim support Unicode?

The editor Vim supports Unicode natively. If your X or console keymap is set up to enter unicode characters via the keymap, it will work fine in Vim. Alternatively, there are two other ways of entering these characters. The slow way is just to use their hex code.

How do I add an Ascii character in Vim?

While in insert mode, you can insert special characters in Vim by pressing <ctrl-k> followed by a two-character lookup code.


2 Answers

From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:

\%u match specified multibyte character (eg \%u20ac) 

That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:

\%u20ac 

The full table of character search patterns includes some additional options:

\%d match specified decimal character (eg \%d123) \%x match specified hex character (eg \%x2a) \%o match specified octal character (eg \%o040) \%u match specified multibyte character (eg \%u20ac) \%U match specified large multibyte character (eg \%U12345678) 
like image 166
michaelmichael Avatar answered Sep 22 '22 23:09

michaelmichael


This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.

I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.

When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.

I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:

:%s/20 euro/20 <ctrl-v u 20ac>/gc 

In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the character.

like image 45
Michael Ekoka Avatar answered Sep 19 '22 23:09

Michael Ekoka