Pandas read_csv and UTF-16

Tags:

I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error:

Click to copy

df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0)
df.head()

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-18-85da1383cd9e> in <module>()
----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header=0)
  2 df.head()

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze, **kwds)
248         kdict['delimiter'] = sep
249 
--> 250     return _read(TextParser, filepath_or_buffer, kdict)
251 
252 @Appender(_read_table_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
198         return parser
199 
--> 200     return parser.get_chunk()
201 
202 @Appender(_read_csv_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
853         elif not self._has_complex_date_col:
854             index = self._get_simple_index(alldata, columns)
--> 855             index = self._agg_index(index)
856 
857         elif self._has_complex_date_col:

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _agg_index(self, index, try_parse_dates)
980                 arr, _ = _convert_types(arr, col_na_values)
981                 arrays.append(arr)
--> 982             index = MultiIndex.from_arrays(arrays, names=self.index_name)
983         return index
984 

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1570 
1571         return MultiIndex(levels=levels, labels=labels,
-> 1572                           sortorder=sortorder, names=names)
1573 
1574     @classmethod

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
1254         assert(len(levels) == len(labels))
1255         if len(levels) == 0:
-> 1256             raise Exception('Must pass non-zero number of levels/labels')
1257 
1258         if len(levels) == 1:

Exception: Must pass non-zero number of levels/labels

Reading the data in line-by-line with csv.reader based on this example implies that my data is not incorrectly formatted:

Click to copy

from io import BytesIO
import csv

with open('data.txt','rb') as f:
    r = f.read().decode('utf-16').encode('utf-8')
    for l in csv.reader(BytesIO(r),delimiter='\t'):
        print l

['Country', 'State/City', 'Title', 'Date', 'Catalogue', 'Wikipedia Election Page', 'Wikipedia Individual Page', 'Electoral Institution in Country', 'Twitter', 'CANDIDATE NAME 1', 'CANDIDATE NAME 2']
['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
['Venezuela', 'N/A', 'President', '10/7/12', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles R.', 'Henrique Capriles', '']

Is there some pre-processing, an addition option in read_csv, or something else that needs to be done before pandas.read_csv can read a utf-16 file? Thanks!

919

asked Dec 03 '12 19:12

Brian Keegan

2 Answers

This is a bug, I think because csv reader was passing back an extra empty line in the beginning. It worked for me on Python 2.7.3 and pandas 0.9.1 if I do:

Click to copy

In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0)
Out[36]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns:
Country                             43  non-null values
State/City                          43  non-null values
Title                               43  non-null values
Date                                43  non-null values
Catalogue                           43  non-null values
Wikipedia Election Page             43  non-null values
Wikipedia Individual Page           43  non-null values
Electoral Institution in Country    43  non-null values
Twitter                             43  non-null values
CANDIDATE NAME 1                    43  non-null values
CANDIDATE NAME 2                    16  non-null values
dtypes: object(11)

I reported the bug here: https://github.com/pydata/pandas/issues/2418 On github master it unfortunately causes a segfault in the c-parser. We'll fix it.

Now, interestingly: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)

149

answered Oct 06 '22 12:10

Chang She

Python3:

Click to copy

with open('data.txt',encoding='UTF-16') as f:
    df = pd.read_csv(f)

answered Oct 06 '22 10:10

avances123

Related questions
                            
                                Handling quotes with FileHelpers
                            
                                Line breaks in generated csv file driving me crazy
                            
                                How can I prevent csv.DictWriter() or writerow() rounding my floats?
                            
                                MySQL load dates in mm/dd/yyyy format
                            
                                Multiple Separators for the same file input R
                            
                                Read a zipped .csv file in R
                            
                                Efficient way to get the unique values from 2 or more columns in a Dataframe
                            
                                R write.csv with UTF-16 encoding
                            
                                Example of writing CSV using StrTk String Toolkit Library
                            
                                Permission denied when pandas dataframe to tempfile csv
                            
                                Reading data from CSV into dataframe with multiple delimiters efficiently
                            
                                Exporting CSV data using SQLCMD.EXE
                            
                                Excel changes date formats
                            
                                Create excel-compatible CSV file with python?
                            
                                Example Application of FasterCSV
                            
                                Linux - join 2 CSV files
                            
                                Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing
                            
                                Python CSV write to file unreadable in Excel (Chinese characters)
                            
                                Parse CSV file which contains a new line to php array
                            
                                Proper way to reset csv.reader for multiple iterations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas read_csv and UTF-16

Tags:

pandas

csv

python-2.7

utf-16

Brian Keegan

People also ask

2 Answers

Chang She

avances123

Recent Activity

Donate For Us