Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to make pandas dataframe Fortran type ordered

I knew a little that inside python pandas package, the dataframe has part that was constructed with NumPy NDArrays. And numpy has the option that you can choose your data order type, like 'C' or 'F'.

Since I always have to implement lots of ops on columns on huge dataframe(like 100 million lines), I expected If I have the chance to transfer dataframe from c type to f type, I could enhance the performance a lot, right?

if so, how could I do that? or simply using numpy, as pandas dataframe is not a must, a quick answer is actually.

Thanks

like image 524
cinqS Avatar asked Sep 11 '25 00:09

cinqS


1 Answers

Interestingly, Pandas uses internally C order numpy array for each column. Whenever you access multiple columns or all of dataframe, it joins those numpy arrays and returns a Fortran order numpy array.

print(df[df.columns[0]].values.flags)
print(df[df.columns[0:2]].values.flags)
print(df.values.flags)

#Single column
C_CONTIGUOUS : True
F_CONTIGUOUS : True

#Multiple columns
C_CONTIGUOUS : False
F_CONTIGUOUS : True

#Entire dataframe
C_CONTIGUOUS : False
F_CONTIGUOUS : True

So, column operations are very fast (add/edit/delete etc). That's why iterating over rows is slow in dataframe. If your program has more row operations, convert it to C order as below.

df = pd.DataFrame(np.ascontiguousarray(df.values), columns=df.columns)

Whenever I am done with processing in columns, I convert it to C contiguous array because scaling, batch training DNN is much faster in C order array.

like image 73
lxkarthi Avatar answered Sep 12 '25 14:09

lxkarthi