I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.) I open the CSV using: <pre class="prettyprint"><code> 15 ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"') </code></pre> Then, I attempt to encode it with: <pre class="prettyprint"><code>name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23]) </code></pre> I'm encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback. <pre class="prettyprint"><code>Traceback (most recent call last): File "push_into_db.py", line 80, in <module> main() File "push_into_db.py", line 74, in main district_map = buildDistrictSchoolMap() File "push_into_db.py", line 32, in buildDistrictSchoolMap county=row[25].encode('utf-8'), lat=row[22], lng=row[23]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128) </code></pre> I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4. I've read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated. You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

Unicode is not equal to UTF-8. The latter is just an encoding for the former. You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string. So just replace <code>.encode</code> with <code>.decode</code>, and it should work (if your .csv is UTF-8-encoded). Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;) Update: If your input data is not UTF-8 encoded, then you have to <code>.decode()</code> with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

Tags:

python

utf-8

django

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I'm encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):   File "push_into_db.py", line 80, in <module>     main()   File "push_into_db.py", line 74, in main     district_map = buildDistrictSchoolMap()   File "push_into_db.py", line 32, in buildDistrictSchoolMap     county=row[25].encode('utf-8'), lat=row[22], lng=row[23]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4. I've read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

631

asked May 02 '12 00:05

jelkimantis

1 Answers

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.

answered Oct 16 '22 06:10

ch3ka

Related questions
                            
                                Nested classes' scope?
                            
                                Numpy: find first index of value fast
                            
                                How to turn on line numbers in IDLE?
                            
                                Is there any way to show the dependency trees for pip packages?
                            
                                Filter by property
                            
                                Python PIP Install throws TypeError: unsupported operand type(s) for -=: 'Retry' and 'int'
                            
                                How can I filter lines on load in Pandas read_csv function?
                            
                                Selecting specific rows and columns from NumPy array
                            
                                How Pony (ORM) does its tricks?
                            
                                How to raise a ValueError?
                            
                                How to know function return type and argument types?
                            
                                Cost of exception handlers in Python
                            
                                What is the difference between 'log' and 'symlog'?
                            
                                Plotting with seaborn using the matplotlib object-oriented interface
                            
                                How do I get the user agent with Flask?
                            
                                Invalid syntax when using "print"? [duplicate]
                            
                                Why in Python does "0, 0 == (0, 0)" equal "(0, False)"?
                            
                                How to split/partition a dataset into training and test datasets for, e.g., cross validation?
                            
                                Split a string at uppercase letters
                            
                                How can strings be concatenated?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With