Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing CSV file in pandas with commas in last column

Tags:

python

pandas

I'm stuck with some poorly formatted CSV data that I need to read into a Pandas dataframe. I cannot change how the data is being recorded (it's coming from someplace else), so please no solutions suggesting that.

Most of the data is fine, but some rows have commas in the last column. A simplified example:

column1 is fine,column 2 is fine,column3, however, has commas in it!

All rows should have the same number of columns (3), but this example of course breaks the CSV reader because the commas suggest there are 5 columns when in fact there are 3.

Notice that there is no quoting that would allow me to use the standard CSV reader tools to handle this problem.

What I do know, however, is that the extra comma(s) always occur in the last (rightmost) column. This means I can use a solution that boils down to:

"Always assume there are 3 columns, counting from the left, and interpret all extra commas as string content within column 3". Or, worded differently, "Interpret the first two commas as column separators, but assume any subsequent commas are just part of the string in column 3."

I can think of plenty of kludgy ways to accomplish this, but my question is: Is there any elegant, concise way of addressing this, preferably within my call to pandas.csv_reader(...)?

like image 466
moustachio Avatar asked Sep 30 '22 14:09

moustachio


1 Answers

Fix the csv, then proceed normally:

import csv
with open('path/to/broken.csv', 'rb') as f, open('path/to/fixed.csv', 'wb') as g:
    writer = csv.writer(g, delimiter=',')
    for line in f:
        row = line.split(',', 2)
        writer.writerow(row)

import pandas as pd
df = pd.read_csv('path/to/fixed.csv')
like image 90
unutbu Avatar answered Oct 03 '22 06:10

unutbu