I knew a little that inside python pandas package, the dataframe has part that was constructed with NumPy NDArrays. And numpy has the option that you can choose your data order type, like 'C' or 'F'.
Since I always have to implement lots of ops on columns on huge dataframe(like 100 million lines), I expected If I have the chance to transfer dataframe from c type to f type, I could enhance the performance a lot, right?
if so, how could I do that? or simply using numpy, as pandas dataframe is not a must, a quick answer is actually.
Thanks
Interestingly, Pandas uses internally C order numpy array for each column. Whenever you access multiple columns or all of dataframe, it joins those numpy arrays and returns a Fortran order numpy array.
print(df[df.columns[0]].values.flags)
print(df[df.columns[0:2]].values.flags)
print(df.values.flags)
#Single column
C_CONTIGUOUS : True
F_CONTIGUOUS : True
#Multiple columns
C_CONTIGUOUS : False
F_CONTIGUOUS : True
#Entire dataframe
C_CONTIGUOUS : False
F_CONTIGUOUS : True
So, column operations are very fast (add/edit/delete etc). That's why iterating over rows is slow in dataframe. If your program has more row operations, convert it to C order as below.
df = pd.DataFrame(np.ascontiguousarray(df.values), columns=df.columns)
Whenever I am done with processing in columns, I convert it to C contiguous array because scaling, batch training DNN is much faster in C order array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With