Numpy:
import numpy as np
nparr = np.array([[1, 5],[2,6], [3, 7]])
print(nparr)
print(nparr[0]) #first choose the row
print(nparr[0][1]) #second choose the column
gives the output as expected:
[[1 5]
[2 6]
[3 7]]
[1 5]
5
Pandas:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [5, 6, 7]
})
print(df)
print(df['a']) #first choose the column !!!
print(df['a'][1]) #second choose the row !!!
gives the following output:
a b
0 1 5
1 2 6
2 3 7
0 1
1 2
2 3
Name: a, dtype: int64
2
What is the fundamental reason for changing the default ordering of "indexes" in Pandas dataframe to be column first? What is the benefit we get for this loss of consistency/intuitiveness?
Of course, if I use the iloc
function we can code it similar to Numpy array indexing:
print(df)
print(df.iloc[0]) # first choose the row
print(df.iloc[0][1]) # second choose the column
a b
0 1 5
1 2 6
2 3 7
a 1
b 5
Name: 0, dtype: int64
5
Because Numpy's intuition is mathematics (more specifically matrices, akin to MATLAB), while Pandas's is databases (akin to SQL). Numpy goes by rows and columns (rows first, because an element (i, j)
of a matrix denotes the i
th row and j
th column), while Pandas works based on the columns of a database, inside which you choose elements, i.e. rows. Of course you can work directly on indices by using iloc
, as you mentioned.
Hope the difference in paradigms/philosophies of the two makes sense.
numpy
indexing is multidimensional. pandas
is table oriented, just 2d (except for a multi-index variation).
In [42]: nparr = np.array([[1, 5],[2,6], [3, 7]])
In [43]: nparr
Out[43]:
array([[1, 5],
[2, 6],
[3, 7]])
In [44]: nparr[0] # select a row
Out[44]: array([1, 5])
In [45]: nparr[:,0] # select a column
Out[45]: array([1, 2, 3])
In [46]: nparr[:,[0]] # also a column, but keep 2d
Out[46]:
array([[1],
[2],
[3]])
In [47]: nparr[:2,[1,0]] # more general - 2 rows, 2 columns (reordered)
Out[47]:
array([[5, 1],
[6, 2]])
Your nparr[0][1]
is more idiomatically written as nparr[0,1]
.
This indexing generalizes to 3d (and higher):
In [48]: arr = np.arange(24).reshape(2,3,4)
In [49]: arr
Out[49]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [50]: arr[1,1,:]
Out[50]: array([16, 17, 18, 19])
It also generalizes to 1d (which will be like indexing a list), and even 0d.
If I make a dataframe from this array, the data or values of the frame are the array itself:
In [52]: df = pd.DataFrame(nparr)
In [53]: df
Out[53]:
0 1
0 1 5
1 2 6
2 3 7
In [54]: df._values
Out[54]:
array([[1, 5],
[2, 6],
[3, 7]])
If I modify an element of the array, we see the change in frame as well:
In [56]: nparr[0,1] *=100
In [57]: nparr
Out[57]:
array([[ 1, 500],
[ 2, 6],
[ 3, 7]])
In [58]: df
Out[58]:
0 1
0 1 500
1 2 6
2 3 7
In [61]: df[1] # a Series
Out[61]:
0 500
1 6
2 7
Name: 1, dtype: int64
pandas
has added its own layer of indexing (including column and row labels) to the underlying array. It will, in one way or other, maps its indexing inputs onto the array's.
Since there are other ways of constructing a dataframe, there isn't always one to one match between a frame and an array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With