How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.

But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).

For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.

So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?

a = np.random.rand(4,2)
array([[ 0.86688862,  0.09682919],
   [ 0.49629578,  0.78263523],
   [ 0.83552411,  0.71907931],
   [ 0.95039642,  0.71795655]])

Out[14]: 0.68320065182041034

gives a different result than what the below gives...

df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 

      0         1
0  0.866889  0.096829
1  0.496296  0.782635
2  0.835524  0.719079
3  0.950396  0.717957

0    0.787276
1    0.579125
dtype: float64

The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?

If you step through this:

> d:\winpython-64bit-\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean

You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():

In [6]:

0    0.572999
1    0.468268
dtype: float64

This is why the output is different

Code to reproduce above:

In [3]:
a = np.random.rand(4,2)

array([[ 0.96750329,  0.67623187],
       [ 0.44025179,  0.97312747],
       [ 0.07330062,  0.18341157],
       [ 0.81094166,  0.04030253]])

In [4]:    


In [5]:    
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 

          0         1
0  0.967503  0.676232
1  0.440252  0.973127
2  0.073301  0.183412
3  0.810942  0.040303

numpy output:

In [7]:

0    0.572999
1    0.468268
dtype: float64

If you'd called .values to return a np array then the output is the same:

In [8]:

