Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Replace typographical quotes, dashes, etc. with their ascii counterparts

Tags:

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).

Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like instead of -.

Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.

All my strings are unicode objects.

like image 524
ThiefMaster Avatar asked Apr 24 '12 08:04

ThiefMaster


People also ask

How do you replace quotes in Python?

To erase Quotes (“”) from a Python string, simply use the replace() command or you can eliminate it if the quotes seem at string ends.

How do you replace single quotes in Python?

Method 1 : Using the replace() method To replace a single quote from the string you will pass the two parameters. The first is the string you want to replace and the other is the string you want to place. In our case it is string. replace(” ' “,” “).

What are 3 quotes in Python?

Note: Triple quotes, according to official Python documentation are docstrings, or multi-line docstrings and are not considered comments. Anything inside triple quotes is read by the interpreter. When the interpreter encounters the hash symbol, it ignores everything after that. That is what a comment is defined to be.


2 Answers

What about this? It creates translation table first, but honestly I don't think you can do this without it.

transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-",  u"'''\"\"--") ] )   with open( "a.txt", "w", encoding = "utf-8" ) as f_out :      a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes”   "     print( " a_str = " + a_str, file = f_out )      fixed_str = a_str.translate( transl_table )     print( " fixed_str = " + fixed_str, file = f_out  ) 

I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:

a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” fixed_str = 'funny single quotes' long--and--short dashes 'nice single quotes' "nice double quotes"

By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language

like image 64
Mateo Avatar answered Oct 07 '22 18:10

Mateo


There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.

For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:

#!/usr/bin/env python3  import unicodedata  def unicode_character_name(char):     try:         return unicodedata.name(char)     except ValueError:         return None  # Generate all Unicode characters with their names all_unicode_characters = [] for n in range(0, 0x10ffff):    # Unicode planes 0-16     char = chr(n)               # Python 3     #char = unichr(n)           # Python 2     name = unicode_character_name(char)     if name:         all_unicode_characters.append((char, name))  # Find all Unicode quotation marks print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name])) # " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸  # Find all Unicode hyphens print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name])) # - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭  # Find all Unicode dashes print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name])) # ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨 

As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.

And there are many questions. For example:

  • should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
  • should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
  • should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?

To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.

like image 27
Andriy Makukha Avatar answered Oct 07 '22 18:10

Andriy Makukha