Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv add header names in case of changing number of columns

Tags:

python

pandas

I have a lot of csv files that I would like to read with Pandas (pd.read_csv), however, in some of the files there is added a column midway that does not have a header, like this example:

Apples, Pears
1, 2
3, 4
5, 6, 7

If using pd.read_csv(example_file) the following error is thrown "ParserError: Error tokenizing data. C error: Expected 2 fields in line 4, saw 3"

I would like to avoid having to skip the line and instead just add a dummy header name, like Unknown1, and get the following result:

Apples, Pears, Unknown1  
1, 2, np.nan
3, 4, np.nan
5, 6, 7
like image 736
Adam R. Jensen Avatar asked Jun 25 '18 16:06

Adam R. Jensen


2 Answers

pandas needs to know the geometry in advance to build the dataframe. You can read the header line and add several dummy column names to supply the number of columns, then re-read the whole csv and discard the columns that weren't used after all.

>>> import pandas as pd
>>> names = list(pd.read_csv('foo.csv', nrows=0)) + ['unknown1', 'unknown2']
>>> df=pd.read_csv('foo.csv', names=names, skiprows=1).dropna(axis='columns', how='all')
>>> df
   Apples   Pears  unknown1
0       1       2       NaN
1       3       4       NaN
2       5       6       7.0

If there are many extra columns and you are worried about the memory footprint of the intermediate dataframe, you can use the csv module to scan the file and calculate the maximum number of rows. Unlike pandas, csv is quite happy to emit varying sized rows.

>>> with open('foo.csv', newline='') as in_fp:
...     reader = csv.reader(in_fp)
...     header = next(reader)
...     num_cols = max(len(row) for row in reader)
... 
>>> names = header + ['unknown{}'.format(i+1) for i in range(num_cols-len(header))]
>>> df = pd.read_csv('foo.csv', names=names, skiprows=1)
>>> df
   Apples   Pears  unknown1
0       1       2       NaN
1       3       4       NaN
2       5       6       7.0
like image 192
tdelaney Avatar answered Oct 19 '22 23:10

tdelaney


We can load the csv then fixed your out after that

import io
t="""Apples, Pears
1, 2
3, 4
5, 6, 7"""
df = pd.read_csv(io.StringIO(t), sep='\t')

yourdf=df.iloc[:,0].str.split(', ',expand=True)
s=df.columns.str.split(', ').tolist()[0]
yourdf.columns=s+['unknow'+str(x+1) for x in range(yourdf.shape[1]-len(s))]


yourdf
Out[104]: 
  Apples Pears unknow1
0      1     2    None
1      3     4    None
2      5     6       7
like image 32
BENY Avatar answered Oct 20 '22 00:10

BENY