So I have a CSV that looks a bit like this:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
And when I try to use the following code to generate a dataFrame..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)
The rest are considered 'bad lines' giving me the following error:
Skipping line 17467: expected 3 fields, saw 9
How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?
Thanks!
To read the csv file as pandas.DataFrame, use the pandas function read_csv () or read_table (). The difference between read_csv () and read_table () is almost nothing. In fact, the same function is called by the source: read_csv () delimiter is a comma character. read_table () is a delimiter of tab t.
It is also possible to create a pandas DataFrame that contains only some of the variables from a CSV file. The following Python programming code explains how to do that based on our example file. For this, we have to assign a list of column names to the usecols argument within the read_csv function.
read_csv() delimiter is a comma character; read_table() is a delimiter of tab \t. Related course: Data Analysis with Python Pandas. Read CSV Read csv with Python. The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file in any modern office suite including Google Sheets.
A simple way to store big data sets is to use CSV files (comma separated files). CSV files contains plain text and is a well know format that can be read by everyone including Pandas. In our examples we will be using a CSV file called 'data.csv'. Download data.csv. or Open data.csv Tip: use to_string () to print the entire DataFrame.
If you know that the data contains N
columns, you can
tell Pandas in advance how many columns to expect via the names
parameter:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
If you have an the upper limit, N
, on the number of columns, then you can
have Pandas read N
columns and then use dropna
to drop completely empty columns:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.
If using only pandas
, read in lines, deal with the separator after.
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Read fixed width should work:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
or with a delimiter
param
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
note that for your actual file you will not use StringIO
you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With