Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas vs Numpy indexing: Why this fundamental difference in ordering of indices?

Numpy:

import numpy as np
nparr = np.array([[1, 5],[2,6], [3, 7]])
print(nparr)
print(nparr[0])    #first choose the row 
print(nparr[0][1]) #second choose the column

gives the output as expected:

[[1 5]
 [2 6]
 [3 7]]

[1 5]

5

Pandas:

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [5, 6, 7]
})
print(df)
print(df['a'])  #first choose the column !!!
print(df['a'][1])  #second choose the row !!!

gives the following output:

   a  b
0  1  5
1  2  6
2  3  7

0    1
1    2
2    3
Name: a, dtype: int64

2

What is the fundamental reason for changing the default ordering of "indexes" in Pandas dataframe to be column first? What is the benefit we get for this loss of consistency/intuitiveness?

Of course, if I use the iloc function we can code it similar to Numpy array indexing:

print(df)
print(df.iloc[0])     # first choose the row
print(df.iloc[0][1])  # second choose the column
   a  b
0  1  5
1  2  6
2  3  7

a    1
b    5
Name: 0, dtype: int64

5
like image 921
2020 Avatar asked Mar 04 '23 00:03

2020


2 Answers

Because Numpy's intuition is mathematics (more specifically matrices, akin to MATLAB), while Pandas's is databases (akin to SQL). Numpy goes by rows and columns (rows first, because an element (i, j) of a matrix denotes the ith row and jth column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc, as you mentioned.

Hope the difference in paradigms/philosophies of the two makes sense.

like image 181
FatihAkici Avatar answered Mar 05 '23 16:03

FatihAkici


numpy indexing is multidimensional. pandas is table oriented, just 2d (except for a multi-index variation).

In [42]: nparr = np.array([[1, 5],[2,6], [3, 7]])                               
In [43]: nparr                                                                  
Out[43]: 
array([[1, 5],
       [2, 6],
       [3, 7]])
In [44]: nparr[0]             # select a row                                                               
Out[44]: array([1, 5])
In [45]: nparr[:,0]           # select a column                                    
Out[45]: array([1, 2, 3])
In [46]: nparr[:,[0]]         # also a column, but keep 2d                                                  
Out[46]: 
array([[1],
       [2],
       [3]])
In [47]: nparr[:2,[1,0]]      # more general - 2 rows, 2 columns (reordered)                                                  
Out[47]: 
array([[5, 1],
       [6, 2]])

Your nparr[0][1] is more idiomatically written as nparr[0,1].

This indexing generalizes to 3d (and higher):

In [48]: arr = np.arange(24).reshape(2,3,4)                                     
In [49]: arr                                                                    
Out[49]: 
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
In [50]: arr[1,1,:]                                                             
Out[50]: array([16, 17, 18, 19])

It also generalizes to 1d (which will be like indexing a list), and even 0d.

If I make a dataframe from this array, the data or values of the frame are the array itself:

In [52]: df = pd.DataFrame(nparr)                                               
In [53]: df                                                                     
Out[53]: 
   0  1
0  1  5
1  2  6
2  3  7
In [54]: df._values                                                             
Out[54]: 
array([[1, 5],
       [2, 6],
       [3, 7]])

If I modify an element of the array, we see the change in frame as well:

In [56]: nparr[0,1] *=100                                                       
In [57]: nparr                                                                  
Out[57]: 
array([[  1, 500],
       [  2,   6],
       [  3,   7]])
In [58]: df                                                                     
Out[58]: 
   0    1
0  1  500
1  2    6
2  3    7

In [61]: df[1]          # a Series                                                        
Out[61]: 
0    500
1      6
2      7
Name: 1, dtype: int64

pandas has added its own layer of indexing (including column and row labels) to the underlying array. It will, in one way or other, maps its indexing inputs onto the array's.

Since there are other ways of constructing a dataframe, there isn't always one to one match between a frame and an array.

like image 30
hpaulj Avatar answered Mar 05 '23 15:03

hpaulj