Replace all accented characters by their LaTeX equivalent

Question

Given a Unicode string, I want to replace non-ASCII characters by LaTeX code producing them (for example, having é become \'e, and œ become \oe). I'm incorporating this into a Python code. This should rely on a translation table, and I have come up with the following code, which is simple and seems to work nicely:

accents = [
    [ u"à", "\`a"],
    [ u"é", "\'e"]
  ]
translation_table = dict([(ord(k), unicode(v)) for k, v in accents])
print u"été à l'eau".translate(translation_table)

But, writing a rather complete translation table will take me a long time, and Google didn't help much. Does someone have such a thing ready, or know where to find one?

PS: I'm new to Python, so I welcome comments on the code above, of course.

F'x · Accepted Answer

OK, so here's the table I've built up for now. Please feel free to edit to add to it! (or comment if you don't have enough reputation to edit)

################################################################
# LaTeX accents replacement
latexAccents = [
  [ u"à", "\`a" ], # Grave accent
  [ u"è", "\`e" ],
  [ u"ì", "\`\i" ],
  [ u"ò", "\`o" ],
  [ u"ù", "\`u" ],
  [ u"ỳ", "\`y" ],
  [ u"À", "\`A" ],
  [ u"È", "\`E" ],
  [ u"Ì", "\`\I" ],
  [ u"Ò", "\`O" ],
  [ u"Ù", "\`U" ],
  [ u"Ỳ", "\`Y" ],
  [ u"á", "\'a" ], # Acute accent
  [ u"é", "\'e" ],
  [ u"í", "\'\i" ],
  [ u"ó", "\'o" ],
  [ u"ú", "\'u" ],
  [ u"ý", "\'y" ],
  [ u"Á", "\'A" ],
  [ u"É", "\'E" ],
  [ u"Í", "\'\I" ],
  [ u"Ó", "\'O" ],
  [ u"Ú", "\'U" ],
  [ u"Ý", "\'Y" ],
  [ u"â", "\^a" ], # Circumflex
  [ u"ê", "\^e" ],
  [ u"î", "\^\i" ],
  [ u"ô", "\^o" ],
  [ u"û", "\^u" ],
  [ u"ŷ", "\^y" ],
  [ u"Â", "\^A" ],
  [ u"Ê", "\^E" ],
  [ u"Î", "\^\I" ],
  [ u"Ô", "\^O" ],
  [ u"Û", "\^U" ],
  [ u"Ŷ", "\^Y" ],
  [ u"ä", "\\"a" ],    # Umlaut or dieresis
  [ u"ë", "\\"e" ],
  [ u"ï", "\\"\i" ],
  [ u"ö", "\\"o" ],
  [ u"ü", "\\"u" ],
  [ u"ÿ", "\\"y" ],
  [ u"Ä", "\\"A" ],
  [ u"Ë", "\\"E" ],
  [ u"Ï", "\\"\I" ],
  [ u"Ö", "\\"O" ],
  [ u"Ü", "\\"U" ],
  [ u"Ÿ", "\\"Y" ],
  [ u"ç", "\c{c}" ],   # Cedilla
  [ u"Ç", "\c{C}" ],
  [ u"œ", "{\oe}" ],   # Ligatures
  [ u"Œ", "{\OE}" ],
  [ u"æ", "{\ae}" ],
  [ u"Æ", "{\AE}" ],
  [ u"å", "{\aa}" ],
  [ u"Å", "{\AA}" ],
  [ u"–", "--" ],   # Dashes
  [ u"—", "---" ],
  [ u"ø", "{\o}" ],    # Misc latin-1 letters
  [ u"Ø", "{\O}" ],
  [ u"ß", "{\ss}" ],
  [ u"¡", "{!`}" ],
  [ u"¿", "{?`}" ],
  [ u"\", "\\" ],    # Characters that should be quoted
  [ u"~", "\~" ],
  [ u"&", "\&" ],
  [ u"$", "\$" ],
  [ u"{", "\{" ],
  [ u"}", "\}" ],
  [ u"%", "\%" ],
  [ u"#", "\#" ],
  [ u"_", "\_" ],
  [ u"≥", "$\ge$" ],   # Math operators
  [ u"≤", "$\le$" ],
  [ u"≠", "$\neq$" ],
  [ u"©", "\copyright" ], # Misc
  [ u"ı", "{\i}" ],
  [ u"µ", "$\mu$" ],
  [ u"°", "$\deg$" ],
  [ u"‘", "`" ],    #Quotes
  [ u"’", "'" ],
  [ u"“", "``" ],
  [ u"”", "''" ],
  [ u"‚", "," ],
  [ u"„", ",," ],
]

Giuseppe Ottaviano · Answer

If you are not in control of LaTeX compilation options, you can use the same table used by the inputenc package, so that the behavior will be the same as if you had used inputenc.

This document explains how inputenc does the mapping, it is a sequence of

...
194 hall; t1; ly1i\DeclareUnicodeCharacter{00C2}{\^A}
195 hall; t1; ly1i\DeclareUnicodeCharacter{00C3}{\~A}
196 hall; t1; ly1i\DeclareUnicodeCharacter{00C4}{\"A}
197 hall; t1; ot1; ly1i\DeclareUnicodeCharacter{00C5}{\r A}
198 hall; t1; ot1; ly1; lcyi\DeclareUnicodeCharacter{00C6}{\AE}
199 hall; t1; ly1i\DeclareUnicodeCharacter{00C7}{\c C}
200 hall; t1; ly1i\DeclareUnicodeCharacter{00C8}{\@tabacckludge`E}

You could parse the file looking for all the DeclareUnicodeCharacter lines and extract with a regexp the mapping.

EDIT: I've written some code that does the trick:

# -*- coding: utf-8 -*-
import re

translation_table = {}

for line in open('utf8ienc.dtx'):
    m = re.match(r'%.*\DeclareUnicodeCharacter\{(\w+)\}\{(.*)\}', line)
    if m:
        codepoint, latex = m.groups()
        latex = latex.replace('@tabacckludge', '') # remove useless (??) '@tabacckludge'
        translation_table[int(codepoint, 16)] = unicode(latex)

print u"été à l'eau".translate(translation_table)

# outputs "\'et\'e \`a l'eau"

You should find utf8ienc.dtx in your latex installation, or you can google it.

Replace all accented characters by their LaTeX equivalent

Tags:

python

unicode

diacritics

latex

F'x

2 Answers

F'x

Giuseppe Ottaviano

Recent Activity

Donate For Us

Replace all accented characters by their LaTeX equivalent

Tags:

python

unicode

diacritics

latex

F'x

2 Answers

F'x

Giuseppe Ottaviano

Related questions

Recent Activity

Donate For Us