Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert pandas dataframe to numpy array - which approach to prefer? [duplicate]

I need to convert a large dataframe to a numpy array. Preserving only numerical values and types. I know there are well documented ways to do so.

So, which one is to prefer?

df.values
df._as_matrix()
pd.to_numeric(df)
... others ...

Decision factor:

  • efficiency

  • safely operating on nan,np.nans, and other possible unexpected values

  • numerically stable

like image 974
00__00__00 Avatar asked Mar 08 '18 18:03

00__00__00


2 Answers

Under the hood, a pandas.DataFrame is not much more than a numpy.array. The simplest and possibly fastest way is to use pandas.DataFrame.values

DataFrame.values

Numpy representation of NDFrame

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type convention, mixing int64 and uint64 will result in a flot64 dtype.

like image 83
ascripter Avatar answered Sep 28 '22 06:09

ascripter


The functions you mention serve different purposes.

  1. pd.to_numeric: Use this to convert types in your dataframe if your data is not currently stored in numeric form or if you wish to cast as an optimal type via downcast='float' or downcast='integer'.

  2. pd.DataFrame.to_numpy() (v0.24+) or pd.DataFrame.values: Use this to retrieve numpy array representation of your dataframe.

  3. pd.DataFrame.as_matrix: Do not use this. It is included only for backwards compatibility.

like image 37
jpp Avatar answered Sep 28 '22 04:09

jpp