Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.

But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).

For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.

So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?

a = np.random.rand(4,2)
a
Out[13]: 
array([[ 0.86688862,  0.09682919],
   [ 0.49629578,  0.78263523],
   [ 0.83552411,  0.71907931],
   [ 0.95039642,  0.71795655]])

np.mean(a)
Out[14]: 0.68320065182041034

gives a different result than what the below gives...

df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))

df
Out[18]: 
      0         1
0  0.866889  0.096829
1  0.496296  0.782635
2  0.835524  0.719079
3  0.950396  0.717957

np.mean(df)
Out[21]: 
0    0.787276
1    0.579125
dtype: float64

The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?

like image 437
a-a Avatar asked May 09 '17 09:05

a-a


People also ask

Does Pandas use NumPy internally?

The answer is NO, numpy and pandas are not strictly bound. Sometimes you need the help of numpy to do some special works, like computations, that's why you may need to import and use. But to work with pandas, numpy is not mandatory. Actually numpy is a pandas dependency and pandas called it internally.

How does Pandas and NumPy work together?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. It is built on top of the NumPy package, which means Numpy is required for operating the Pandas. The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data.

How do Pandas work internally?

A DataFrame object relies on underlying data structures to improve performance of row-oriented and column-oriented operations. One of these data structures includes the BlockManager. The BlockManager is a core architectural component that is an internal storage object in Pandas.

What does NumPy do in Pandas?

NumPy is a library for Python that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Pandas is a high-level data manipulation tool that is built on the NumPy package.


1 Answers

If you step through this:

--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean

You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():

In [6]:

df.mean()
Out[6]:
0    0.572999
1    0.468268
dtype: float64

This is why the output is different

Code to reproduce above:

In [3]:
a = np.random.rand(4,2)
a

Out[3]:
array([[ 0.96750329,  0.67623187],
       [ 0.44025179,  0.97312747],
       [ 0.07330062,  0.18341157],
       [ 0.81094166,  0.04030253]])

In [4]:    
np.mean(a)

Out[4]:
0.52063384885403818

In [5]:    
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))
​
df

Out[5]:
          0         1
0  0.967503  0.676232
1  0.440252  0.973127
2  0.073301  0.183412
3  0.810942  0.040303

numpy output:

In [7]:
np.mean(df)

Out[7]:
0    0.572999
1    0.468268
dtype: float64

If you'd called .values to return a np array then the output is the same:

In [8]:
np.mean(df.values)

Out[8]:
0.52063384885403818
like image 196
EdChum Avatar answered Sep 21 '22 05:09

EdChum