Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python CSV write to file unreadable in Excel (Chinese characters)

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.

    # -*- coding: utf-8 -*-
    import os
    import glob
    import jieba
    import jieba.analyse
    import csv
    import codecs  

    segList = []
    raw_data_path = 'monthly_raw_data/'
    file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]

    jieba.load_userdict("customized_dict.txt")

    for name in file_name:
        all_text = ""
        multi_line_text = ""
        with open(raw_data_path + name + ".txt", "r") as file:
            for line in file:
                if line != '\n':
                    multi_line_text += line
            templist = multi_line_text.split('\n')
            for text in templist:
                all_text += text
            seg_list = jieba.cut(all_text,cut_all=False)
            temp_text = []
            for item in seg_list:
                temp_text.append(item.encode('utf-8'))

            stop_list = []
            with open("stopwords.txt", "r") as stoplistfile:
                for item in stoplistfile:
                    stop_list.append(item.rstrip('\r\n'))

            text_without_stopwords = []
            for word in temp_text:
                if word not in stop_list:
                    text_without_stopwords.append(word)

            segList.append(text_without_stopwords)


    with open("results/result.csv", 'wb') as f:
        writer = csv.writer(f)
        writer.writerows(segList)
like image 899
flyingmouse Avatar asked Dec 27 '15 15:12

flyingmouse


People also ask

Can CSV save Chinese characters?

xlsx format as Chinese characters are not compatible with CSV when exported directly. When converting CSV files to Microsoft Excel, Asian characters (e.g. CJK characters) tend to loose their encoding, hence they are not being displayed properly when opening in Excel.


1 Answers

For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:

#!python2
#coding:utf8
import csv

data = [[u'American',u'美国人'],
        [u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8'))
    w = csv.writer(f)
    for row in data:
        w.writerow([item.encode('utf8') for item in row])

Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:

#!python3
#coding:utf8
import csv

data = [['American','美国人'],
        ['Chinese','中国人']]

with open('results.csv','w',newline='',encoding='utf-8-sig') as f:
    w = csv.writer(f)
    w.writerows(data)

There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:

#!python2
#coding:utf8
import unicodecsv

data = [[u'American',u'美国人'],
        [u'Chinese',u'中国人']]

with open('results.csv','wb') as f:
    w = unicodecsv.writer(f,encoding='utf-8-sig')
    w.writerows(data)
like image 145
Mark Tolonen Avatar answered Nov 15 '22 14:11

Mark Tolonen