Read CSV into a dataFrame with varying row lengths using Pandas

Tags:

So I have a CSV that looks a bit like this:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...

And when I try to use the following code to generate a dataFrame..

df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)

It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)

The rest are considered 'bad lines' giving me the following error:

Skipping line 17467: expected 3 fields, saw 9

How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?

Thanks!

241

asked Mar 12 '19 19:03

caaax

3 Answers

If you know that the data contains N columns, you can tell Pandas in advance how many columns to expect via the names parameter:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

If you have an the upper limit, N, on the number of columns, then you can have Pandas read N columns and then use dropna to drop completely empty columns:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.

183

answered Nov 08 '22 09:11

unutbu

If using only pandas, read in lines, deal with the separator after.

import pandas as pd

df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)

   0           1    2     3     4     5     6
0  1  01-01-2019  724  None  None  None  None
1  2  01-01-2019  233   436  None  None  None
2  3  01-01-2019  345  None  None  None  None
3  4  01-01-2019  803   933   943   923   954
4  5  01-01-2019  454  None  None  None  None

answered Nov 08 '22 09:11

ALollz

Read fixed width should work:

from io import StringIO

s = '''1  01-01-2019  724
2  01-01-2019  233  436
3  01-01-2019  345
4  01-01-2019  803  933  943  923  954
5  01-01-2019  454'''


pd.read_fwf(StringIO(s), header=None)

   0           1    2      3      4      5      6
0  1  01-01-2019  724    NaN    NaN    NaN    NaN
1  2  01-01-2019  233  436.0    NaN    NaN    NaN
2  3  01-01-2019  345    NaN    NaN    NaN    NaN
3  4  01-01-2019  803  933.0  943.0  923.0  954.0
4  5  01-01-2019  454    NaN    NaN    NaN    NaN

or with a delimiter param

s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''


pd.read_fwf(StringIO(s), header=None, delimiter='|')

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)

answered Nov 08 '22 07:11

It_is_Chris

Related questions
                            
                                Python 3.5.1 : NameError: name 'json' is not defined
                            
                                Setting up periodic tasks in Celery (celerybeat) dynamically using add_periodic_task
                            
                                debug Flask server inside Jupyter Notebook
                            
                                How to create both short and long options for one option in click (python package)?
                            
                                Sort dict of dict in jinja2 loop
                            
                                How to send urlencoded parameters in POST request in python
                            
                                How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment
                            
                                How to read a large json in pandas?
                            
                                Understanding Text feature extraction TfidfVectorizer in python scikit-learn
                            
                                psycopg2.DataError: invalid input syntax for integer: "test" Getting error when moving code to test server
                            
                                count plot with stacked bars per hue [duplicate]
                            
                                django - post data query dict is empty
                            
                                How to convert an HTML table into a Python dictionary
                            
                                Formatting y-axis matplotlib with thousands separator and font size
                            
                                Could not import "D": FLASK_APP
                            
                                What is the inverse operation of np.log() and np.diff()?
                            
                                How to solve the Attribute error 'float' object has no attribute 'split' in python?
                            
                                Django Custom User --- Edit new CustomUser fields in admin
                            
                                How do I run commands in PyCharm without having to run the whole script?
                            
                                VSCode running Python 2 instead of 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read CSV into a dataFrame with varying row lengths using Pandas

Tags:

python

pandas

dataframe

csv