Pandas cannot load data, csv encoding mystery

Tags:

I am trying to load a dataset into pandas and cannot get seem to get past step 1. I am new so please forgive if this is obvious, I have searched previous topics and not found an answer. The data is mostly in Chinese characters, which may be the issue.

The .csv is very large, and can be found here: http://weiboscope.jmsc.hku.hk/datazip/ I am trying on week 1.

In my code below, I identify 3 types of decoding I attempted, including an attempt to see what encoding was used

import pandas
import chardet
import os


#this is what I tried to start
    data = pandas.read_csv('week1.csv', encoding="utf-8")

    #spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte

#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)

#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")

#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")

Any help would be greatly appreciated!

EDIT: The answer provided by @Kristof does in fact work, as does the program a colleague of mine put together yesterday:

import csv
import pandas as pd

def clean_weiboscope(file, nrows=0):
    res = []
    with open(file, 'r', encoding='utf-8', errors='ignore') as f:
        reader = csv.reader(f)
        for i, row in enumerate(f):
            row = row.replace('\n', '')
            if nrows > 0 and i > nrows:
                break
            if i == 0:
                headers = row.split(',')
            else:
                res.append(tuple(row.split(',')))
    df = pd.DataFrame(res)
    return df

my_df = clean_weiboscope('week1.csv', nrows=0)

I also wanted to add for future searchers that this is the Weiboscope open data for 2012.

546

asked Aug 02 '16 18:08

a mark

1 Answers

It seems that there's something very wrong with the input file. There are encoding errors throughout.

One thing you could do, is to read the CSV file as a binary, decode the binary string and replace the erroneous characters.

Example (source for the chunk-reading code):

in_filename = 'week1.csv'
out_filename = 'repaired.csv'

from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time

# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
    with open(out_filename, 'w') as out_file:
        for byte_fragment in iter(partial(in_file.read, chunksize), b''):
            out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))

# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)

df.shape
>> (4790108, 11)

df.head()

sample output

188

answered Sep 28 '22 05:09

DocZerø

Related questions
                            
                                How to build Python project including dependencies?
                            
                                Python : Locking text file on NFS
                            
                                How to resolve : Very large size tasks in spark
                            
                                Why do my Python PIL imports not working?
                            
                                Unable to resolve dependencies for the Python OCR Library pypdfocr [duplicate]
                            
                                How do you get the range of the x-axis of a Plotly graph?
                            
                                Changing Selenium driver for new URL
                            
                                Python convert (read & save) excel xlsx to xls
                            
                                Getting 'av_interleaved_write_frame(): Broken pipe' error
                            
                                numpy stride_tricks.as_strided vs list comprehension for rolling window
                            
                                Python integer caching
                            
                                Name or service not known
                            
                                drop rows with errors for pandas data coercion
                            
                                OpenCV-Python: How to detect a hotspot in thermal image?
                            
                                Py2exe error: [Errno 2] No such file or directory
                            
                                Initial node's ids when creating graph from edge list
                            
                                Private Variables and Class-local References
                            
                                Use base class's property/attribute as a table column?
                            
                                Exception when training data in Predictionio
                            
                                Counting how many times a row occurs in a matrix (numpy)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas cannot load data, csv encoding mystery

Tags:

python

pandas

chardet

a mark

People also ask

1 Answers

DocZerø

Recent Activity

Donate For Us