Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas read_csv unable to read character double quoted twice

Tags:

python

pandas

csv

I'm trying to a 2 columns csv file (error.csv) with semi-column separator which contains double quoted semi-columns:

col1;col2
2016-04-17_22:34:25.126;"Linux; Android"
2016-04-17_22:34:25.260;"{"g":2}iPhone; iPhone"

And I'm trying:

logs = pd.read_csv('error.csv', na_values="null", sep=';', 
                   quotechar='"', quoting=0)

I understand that the problem comes from having a double quoted "g" inside my double quotes in line 3 but I can't figure out how to deal with it. Any ideas ?

like image 858
Gilles Cuyaubere Avatar asked Nov 09 '22 16:11

Gilles Cuyaubere


1 Answers

You will probably need to pre-process the data so that it conforms to the expected CSV format. I doubt pandas will handle this just by changing a parameter or two.

If there are only two columns, and the first never contains a semi-colon, then you could split the lines on the first semi-colon:

records = []
with open('error.csv', 'r') as fh:
    # first row is a header
    header = next(fh).strip().split(';')

    for rec in fh:
        # split only on the first semi-colon
        date, dat = rec.strip().split(';', maxsplit=1)
        # assemble records, removing quotes from the second column
        records.append((date, dat.strip('"')))

# create a data frame
df = pandas.DataFrame.from_records(records, columns=header)

You will have to manually parse the dates yourself with the datetime module if you want the first column to contain proper dates and not strings.

like image 180
ChrisP Avatar answered Nov 14 '22 21:11

ChrisP