I'm stuck with some poorly formatted CSV data that I need to read into a Pandas dataframe. I cannot change how the data is being recorded (it's coming from someplace else), so please no solutions suggesting that.
Most of the data is fine, but some rows have commas in the last column. A simplified example:
column1 is fine,column 2 is fine,column3, however, has commas in it!
All rows should have the same number of columns (3), but this example of course breaks the CSV reader because the commas suggest there are 5 columns when in fact there are 3.
Notice that there is no quoting that would allow me to use the standard CSV reader tools to handle this problem.
What I do know, however, is that the extra comma(s) always occur in the last (rightmost) column. This means I can use a solution that boils down to:
"Always assume there are 3 columns, counting from the left, and interpret all extra commas as string content within column 3". Or, worded differently, "Interpret the first two commas as column separators, but assume any subsequent commas are just part of the string in column 3."
I can think of plenty of kludgy ways to accomplish this, but my question is: Is there any elegant, concise way of addressing this, preferably within my call to pandas.csv_reader(...)
?
Fix the csv, then proceed normally:
import csv
with open('path/to/broken.csv', 'rb') as f, open('path/to/fixed.csv', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 2)
writer.writerow(row)
import pandas as pd
df = pd.read_csv('path/to/fixed.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With