Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV into a dataFrame with varying row lengths using Pandas

So I have a CSV that looks a bit like this:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...

And when I try to use the following code to generate a dataFrame..

df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)

It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)

The rest are considered 'bad lines' giving me the following error:

Skipping line 17467: expected 3 fields, saw 9

How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?

Thanks!

like image 241
caaax Avatar asked Mar 12 '19 19:03

caaax


People also ask

How to read the CSV file in pandas Dataframe?

To read the csv file as pandas.DataFrame, use the pandas function read_csv () or read_table (). The difference between read_csv () and read_table () is almost nothing. In fact, the same function is called by the source: read_csv () delimiter is a comma character. read_table () is a delimiter of tab t.

How to create a pandas Dataframe with only some variables?

It is also possible to create a pandas DataFrame that contains only some of the variables from a CSV file. The following Python programming code explains how to do that based on our example file. For this, we have to assign a list of column names to the usecols argument within the read_csv function.

What is the difference between read_CSV and read_table in pandas?

read_csv() delimiter is a comma character; read_table() is a delimiter of tab \t. Related course: Data Analysis with Python Pandas. Read CSV Read csv with Python. The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file in any modern office suite including Google Sheets.

How to store big data sets in pandas?

A simple way to store big data sets is to use CSV files (comma separated files). CSV files contains plain text and is a well know format that can be read by everyone including Pandas. In our examples we will be using a CSV file called 'data.csv'. Download data.csv. or Open data.csv Tip: use to_string () to print the entire DataFrame.


3 Answers

If you know that the data contains N columns, you can tell Pandas in advance how many columns to expect via the names parameter:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

If you have an the upper limit, N, on the number of columns, then you can have Pandas read N columns and then use dropna to drop completely empty columns:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.

like image 183
unutbu Avatar answered Nov 08 '22 09:11

unutbu


If using only pandas, read in lines, deal with the separator after.

import pandas as pd

df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)

   0           1    2     3     4     5     6
0  1  01-01-2019  724  None  None  None  None
1  2  01-01-2019  233   436  None  None  None
2  3  01-01-2019  345  None  None  None  None
3  4  01-01-2019  803   933   943   923   954
4  5  01-01-2019  454  None  None  None  None
like image 26
ALollz Avatar answered Nov 08 '22 09:11

ALollz


Read fixed width should work:

from io import StringIO

s = '''1  01-01-2019  724
2  01-01-2019  233  436
3  01-01-2019  345
4  01-01-2019  803  933  943  923  954
5  01-01-2019  454'''


pd.read_fwf(StringIO(s), header=None)

   0           1    2      3      4      5      6
0  1  01-01-2019  724    NaN    NaN    NaN    NaN
1  2  01-01-2019  233  436.0    NaN    NaN    NaN
2  3  01-01-2019  345    NaN    NaN    NaN    NaN
3  4  01-01-2019  803  933.0  943.0  923.0  954.0
4  5  01-01-2019  454    NaN    NaN    NaN    NaN

or with a delimiter param

s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''


pd.read_fwf(StringIO(s), header=None, delimiter='|')

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)

like image 33
It_is_Chris Avatar answered Nov 08 '22 07:11

It_is_Chris