How to convert unicode accented characters to pure ascii without accents?

Tags:

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t

The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.

My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?

Python calling code:

import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)

I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.

342

asked Jan 02 '13 07:01

Wolf

1 Answers

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?

Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

Explicit example...

>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>

How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).

129

answered Oct 11 '22 17:10

Mike Pennington

Related questions
                            
                                Multiple constructors in python, using inheritance
                            
                                pyspark and HDFS commands
                            
                                Making histogram with Spark DataFrame column
                            
                                Why am I getting the error: No module named 'email.MIMEMultipart'?
                            
                                How do order of operations go on Python?
                            
                                Is there a function in Python to split a string without ignoring the spaces?
                            
                                How can I capture the stdout output of a child process?
                            
                                Should I Start With Python 3.0? [closed]
                            
                                Set paragraph font in python-docx
                            
                                Boto3 S3: Get files without getting folders
                            
                                What Python GUI APIs Are Out There? [closed]
                            
                                When is the `==` operator not equivalent to the `is` operator? (Python)
                            
                                Django 'if and' template
                            
                                Make Python Program Wait
                            
                                Computing Eulers Totient Function
                            
                                Issue trying to change language from Django template
                            
                                Measure runtime of a Jupyter Notebook code cell
                            
                                What's the reason of the error ValueError: Expected more than 1 value per channel?
                            
                                Most pythonic way of function with no return?
                            
                                Is there a better way to write this "if" boolean evaluation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert unicode accented characters to pure ascii without accents?

Tags:

python

unicode

wget

unicode-normalization

Wolf

People also ask

1 Answers

Mike Pennington

Recent Activity

Donate For Us