I have a data set that looks like this (at most 5 columns - but can be less)
1,2,3 1,2,3,4 1,2,3,4,5 1,2 1,2,3,4 ....
I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.
If I try
import pandas as pd my_cols=['A','B','C','D','E'] my_df=pd.read_table(path,sep=',',header=None,names=my_cols)
I get an error - "column names have 5 fields, data has 3 fields".
Is there any way to make pandas fill in NaN for the missing columns while reading the data?
We can use double square brackets [[]] to select multiple columns from a data frame in Pandas. In the above example, we used a list containing just a single variable/column name to select the column. If we want to select multiple columns, we specify the list of column names in the order we like.
There isn't a set maximum of columns - the issue is that you've quite simply run out of available memory on your computer, unfortunately.
Pandas DataFrame count() MethodThe count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns' , and returns a Series object with the result for each row (or column).
One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):
>>> !cat ragged.csv 1,2,3 1,2,3,4 1,2,3,4,5 1,2 1,2,3,4 >>> my_cols = ["A", "B", "C", "D", "E"] >>> pd.read_csv("ragged.csv", names=my_cols, engine='python') A B C D E 0 1 2 3 NaN NaN 1 1 2 3 4 NaN 2 1 2 3 4 5 3 1 2 NaN NaN NaN 4 1 2 3 4 NaN
Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With