Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read excel Unicode characters using Python

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".

My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.

When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)

Being more concrete I perform the following:

Read the Excel:

  xlApp = Dispatch("Excel.Application")

  excel = xlApp.Workbooks.Open(excel_location)

in an internal loop I extract the value of the cell:

cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)

and finally, if I try to convert cell_value to str, crashes:

print str(cell_value)

If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.

What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.

Any idea how can I progress?

like image 304
Trebia Project. Avatar asked Mar 17 '23 08:03

Trebia Project.


2 Answers

This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.

The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).

Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.

Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).

That's when the ascii encoder says: Sorry, can't help you there.

That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:

s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii

If your terminal supports UTF-8, then you need to tell Python about it:

import sys
import codecs

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.

Related:

  • Python 2 Unicode howto
  • Pragmatic Unicode, or, How do I stop the pain?
  • Setting the correct encoding when piping stdout in Python
like image 94
Aaron Digulla Avatar answered Mar 18 '23 21:03

Aaron Digulla


.Cells(row,col) returns a Range object. You probably want the text from the cell:

cell = xl.ActiveSheet.Cells(1,2).Text

or

cell = xl.ActiveSheet.Range('B1').Text

The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding), for example:

bytes = cell.encode('utf8')

The below example uses the following spreadsheet:

enter image description here

import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value

Output (Note, Chinese will only print if console/IDE supports the characters):

<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人
like image 22
Mark Tolonen Avatar answered Mar 18 '23 22:03

Mark Tolonen