Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sublime text replace multiple accented characters with unaccented ones at once

I need to replace all characters with an accent in a text file, that is:

á é í ó ú ñ

for their non-accent equivalents:

a e i o u n

Can this be achieved via some regex command for the entire file at once?


Update (Feb 1st, 2017)

I took the great answer by Keith Hall and turned into a Sublime package. You can find it here: Remove​Non​Ascii​Chars.

like image 741
Gabriel Avatar asked Aug 12 '16 02:08

Gabriel


1 Answers

You can use a regex like:

(?=\p{L})[^a-zA-Z]

to find the characters with diacritics.

  • (?=\p{L}) positive lookahead to ensure the next character is a Unicode letter
  • [^a-zA-Z] negative character class to exclude letters without diacritics.

This is necessary because Sublime Text (or, more specifically, the Boost regex engine it uses for Find and Replace) doesn't support \p{M}. See http://www.regular-expressions.info/unicode.html for more information on what the \p meta character does.


For replacing, unfortunately you will need to specify the characters to replace manually. To make it harder, ST doesn't seem to support the POSIX character equivalents, nor does it support conditionals in the replacement, which would allow you to do the find and replace in one pass, using capture groups.

Therefore, you would need to use multiple find expressions like:

[ÀÁÂÃÄÅ]

replace with

A

and

[àáâãäå]

replace with

a

etc.

which is a lot of manual work.


A much easier/quicker/less-manual-work approach would be to use the Python API instead of regex:

  1. Tools menu -> Developer -> New Plugin
  2. Paste in the following:

    import sublime
    import sublime_plugin
    import unicodedata
    
    class RemoveNonAsciiCharsCommand(sublime_plugin.TextCommand):
        def run(self, edit):
            entire_view = sublime.Region(0, self.view.size())
            ascii_only = unicodedata.normalize('NFKD', self.view.substr(entire_view)).encode('ascii', 'ignore').decode('utf-8')
            self.view.replace(edit, entire_view, ascii_only)
    
  3. Save it in the folder ST recommends (which will be your Packages/User folder), as something like remove_non_ascii_chars.py (file extension is important, base name isn't)

  4. View menu -> Show Console
  5. Type/paste in view.run_command('remove_non_ascii_chars') and press Enter
  6. The diacritics will have been removed (the characters with an accent will have been converted to their non-accented equivalents).

Note: the above will actually also remove all non-ascii characters as well...

Further reading:

  • http://fabzter.com/blog/remove-nonspacing-characters-text-python
  • What is the best way to remove accents in a Python unicode string?
like image 64
Keith Hall Avatar answered Sep 24 '22 05:09

Keith Hall