Python CSV write to file unreadable in Excel (Chinese characters)

Tags:

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.

Click to copy

    # -*- coding: utf-8 -*-
    import os
    import glob
    import jieba
    import jieba.analyse
    import csv
    import codecs  

    segList = []
    raw_data_path = 'monthly_raw_data/'
    file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]

    jieba.load_userdict("customized_dict.txt")

    for name in file_name:
        all_text = ""
        multi_line_text = ""
        with open(raw_data_path + name + ".txt", "r") as file:
            for line in file:
                if line != '\n':
                    multi_line_text += line
            templist = multi_line_text.split('\n')
            for text in templist:
                all_text += text
            seg_list = jieba.cut(all_text,cut_all=False)
            temp_text = []
            for item in seg_list:
                temp_text.append(item.encode('utf-8'))

            stop_list = []
            with open("stopwords.txt", "r") as stoplistfile:
                for item in stoplistfile:
                    stop_list.append(item.rstrip('\r\n'))

            text_without_stopwords = []
            for word in temp_text:
                if word not in stop_list:
                    text_without_stopwords.append(word)

            segList.append(text_without_stopwords)


    with open("results/result.csv", 'wb') as f:
        writer = csv.writer(f)
        writer.writerows(segList)

899

asked Dec 27 '15 15:12

flyingmouse

1 Answers

For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:

Click to copy

#!python2
#coding:utf8
import csv

data = [[u'American',u'美国人'],
        [u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8'))
    w = csv.writer(f)
    for row in data:
        w.writerow([item.encode('utf8') for item in row])

Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:

Click to copy

#!python3
#coding:utf8
import csv

data = [['American','美国人'],
        ['Chinese','中国人']]

with open('results.csv','w',newline='',encoding='utf-8-sig') as f:
    w = csv.writer(f)
    w.writerows(data)

There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:

Click to copy

#!python2
#coding:utf8
import unicodecsv

data = [[u'American',u'美国人'],
        [u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
    w = unicodecsv.writer(f,encoding='utf-8-sig')
    w.writerows(data)

145

answered Nov 15 '22 14:11

Mark Tolonen

Related questions
                            
                                Is there an alternate for the now removed module 'nltk.model.NGramModel'?
                            
                                When to call Python's super().__init__()?
                            
                                How can I select 'last business day of the month' in Pandas?
                            
                                How to use Non-Standard Custom Font with Stylesheets?
                            
                                ForeignKeys clashing when using abstract multiple inheritance in Django
                            
                                How to to make a file private by securing the url that only authenticated users can see
                            
                                Phong shading for shiny Python 3D surface plots
                            
                                how to correctly check for scroll end?
                            
                                how to get the line number of an error from exec or execfile in Python
                            
                                Getting legend in seaborn jointplot
                            
                                app engine: ImportError: No module named Crypto.Hash
                            
                                Python removing punctuation from unicode string except apostrophe
                            
                                Snippets vs. Abbreviations in Vim
                            
                                My Django installs in virtual env are missing admin templates folder
                            
                                Detecting lines and shapes in OpenCV using Python
                            
                                Python Django Asynchronous Request handling
                            
                                Test if two numpy arrays are (close to) equal, including shape
                            
                                DjangoRestFramework - Omit null fields when serializing objects
                            
                                Python pickle calls cPickle?
                            
                                convert numpy warnings to errors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python CSV write to file unreadable in Excel (Chinese characters)

Tags:

python

csv

excel

unicode

flyingmouse

People also ask

1 Answers

Mark Tolonen

Recent Activity

Donate For Us