Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign

Tags:

I've read something about a Python 2 limitation with respect to Pandas' to_csv( ... etc ...). Have I hit it? I'm on Python 2.7.3

This turns out trash characters for ≥ and - when they appear in strings. Aside from that the export is perfect.

df.to_csv("file.csv", encoding="utf-8")

Is there any workaround?

df.head() is this:

demography  Adults ≥49 yrs  Adults 18−49 yrs at high risk||  \ state                                                            Alabama                 32.7                             38.6    Alaska                  31.2                             33.2    Arizona                 22.9                             38.8    Arkansas                31.2                             34.0    California              29.8                             38.8

csv output is this

state,  Adults â‰¥49 yrs,   Adults 18âˆ’49 yrs at high risk|| 0,  Alabama,    32.7,   38.6 1,  Alaska, 31.2,   33.2 2,  Arizona,    22.9,   38.8 3,  Arkansas,31.2,  34 4,  California,29.8, 38.8

the whole code is this:

import pandas import xlrd import csv import json  df = pandas.DataFrame() dy = pandas.DataFrame() # first merge all this xls together   workbook = xlrd.open_workbook('csv_merger/vaccoverage.xls') worksheets = workbook.sheet_names()   for i in range(3,len(worksheets)):     dy = pandas.io.excel.read_excel(workbook, i, engine='xlrd', index=None)     i = i+1     df = df.append(dy)  df.index.name = "index"  df.columns = ['demography', 'area','state', 'month', 'rate', 'moe']  #Then just grab month = 'May'  may_mask = df['month'] == "May" may_df = (df[may_mask])  #then delete some columns we dont need  may_df = may_df.drop('area', 1) may_df = may_df.drop('month', 1) may_df = may_df.drop('moe', 1)   print may_df.dtypes #uh oh, it sees 'rate' as type 'object', not 'float'.  Better change that.  may_df = may_df.convert_objects('rate', convert_numeric=True)  print may_df.dtypes #that's better  res = may_df.pivot_table('rate', 'state', 'demography') print res.head()   #and this is going to spit out an array of Objects, each Object a state containing its demographics res.reset_index().to_json("thejson.json", orient='records') #and a .csv for good measure res.reset_index().to_csv("thecsv.csv", orient='records', encoding="utf-8")

925

asked Sep 11 '14 12:09

Maggie

1 Answers

Your "bad" output is UTF-8 displayed as CP1252.

On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. While a BOM is meaningless to the UTF-8 encoding, its UTF-8-encoded presence serves as a signature for some programs. For example, Microsoft Office's Excel requires it even on non-Windows OSes. Try:

df.to_csv('file.csv',encoding='utf-8-sig')

That encoder will add the BOM.

110

answered Sep 25 '22 00:09

Mark Tolonen

Related questions
                            
                                Conditional operator in Python? [duplicate]
                            
                                CORS error on same domain?
                            
                                Get pixel's RGB using PIL
                            
                                How to assign to repeated field?
                            
                                Detect & Record Audio in Python
                            
                                Send data from a textbox into Flask?
                            
                                Python: Append item to list N times
                            
                                How can I use a pip requirements file to uninstall as well as install packages?
                            
                                How to convert a timezone aware string to datetime in Python without dateutil?
                            
                                Run code before and after each test in py.test?
                            
                                Why doesn't requests.get() return? What is the default timeout that requests.get() uses?
                            
                                Counting the number of non-NaN elements in a numpy ndarray in Python
                            
                                How to implement the --verbose or -v option into a script?
                            
                                How to execute ipdb.set_trace() at will while running pytest tests
                            
                                Platform independent path concatenation using "/" , "\"?
                            
                                method of iterating over sqlalchemy model's defined columns?
                            
                                Get an attribute value based on the name attribute with BeautifulSoup
                            
                                Python strip with \n [duplicate]
                            
                                Create a file if it doesn't exist
                            
                                Convert number strings with commas in pandas DataFrame to float

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign

Tags:

python

pandas

csv

utf-8

Maggie

People also ask

1 Answers

Mark Tolonen

Recent Activity

Donate For Us