Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas

I'm confused about the syntax regarding the following line of code:

x_values = dataframe[['Brains']]

The dataframe object consists of 2 columns (Brains and Bodies)

Brains Bodies
42     34
32     23

When I print x_values I get something like this:

Brains
0  42
1  32

I'm aware of the pandas documentation as far as attributes and methods of the dataframe object are concerned, but the double bracket syntax is confusing me.

like image 526
Mike Fellner Avatar asked Jul 19 '17 21:07

Mike Fellner


4 Answers

Consider this:

Source DF:

In [79]: df
Out[79]:
   Brains  Bodies
0      42      34
1      32      23

Selecting one column - results in Pandas.Series:

In [80]: df['Brains']
Out[80]:
0    42
1    32
Name: Brains, dtype: int64

In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series

Selecting subset of DataFrame - results in DataFrame:

In [82]: df[['Brains']]
Out[82]:
   Brains
0      42
1      32

In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame

Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...

Demo:

In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))

In [85]: df
Out[85]:
          a         b         c         d         e         f
0  0.065196  0.257422  0.273534  0.831993  0.487693  0.660252
1  0.641677  0.462979  0.207757  0.597599  0.117029  0.429324
2  0.345314  0.053551  0.634602  0.143417  0.946373  0.770590
3  0.860276  0.223166  0.001615  0.212880  0.907163  0.437295
4  0.670969  0.218909  0.382810  0.275696  0.012626  0.347549

In [86]: df[['e','a','c']]
Out[86]:
          e         a         c
0  0.487693  0.065196  0.273534
1  0.117029  0.641677  0.207757
2  0.946373  0.345314  0.634602
3  0.907163  0.860276  0.001615
4  0.012626  0.670969  0.382810

and if we specify only one column in the list we will get a DataFrame with one column:

In [87]: df[['e']]
Out[87]:
          e
0  0.487693
1  0.117029
2  0.946373
3  0.907163
4  0.012626
like image 140
MaxU - stop WAR against UA Avatar answered Nov 16 '22 13:11

MaxU - stop WAR against UA


There is no special syntax in Python for [[ and ]]. Rather, a list is being created, and then that list is being passed as an argument to the DataFrame indexing function.

As per @MaxU's answer, if you pass a single string to a DataFrame a series that represents that one column is returned. If you pass a list of strings, then a DataFrame that contains the given columns is returned.

So, when you do the following

# Print "Brains" column as Series
print(df['Brains'])
# Return a DataFrame with only one column called "Brains"
print(df[['Brains']])

It is equivalent to the following

# Print "Brains" column as Series
column_to_get = 'Brains'
print(df[column_to_get])
# Return a DataFrame with only one column called "Brains"
subset_of_columns_to_get = ['Brains']
print(df[subset_of_columns_to_get])

In both cases, the DataFrame is being indexed with the [] operator.

Python uses the [] operator for both indexing and for constructing list literals, and ultimately I believe this is your confusion. The outer [ and ] in df[['Brains']] is performing the indexing, and the inner is creating a list.

>>> some_list = ['Brains']
>>> some_list_of_lists = [['Brains']]
>>> ['Brains'] == [['Brains']][0]
True
>>> 'Brains' == [['Brains']][0][0] == [['Brains'][0]][0]
True

What I am illustrating above is that at no point does Python ever see [[ and interpret it specially. In the last convoluted example ([['Brains'][0]][0]) there is no special ][ operator or ]][ operator... what happens is

  • A single-element list is created (['Brains'])
  • The first element of that list is indexed (['Brains'][0] => 'Brains')
  • That is placed into another list ([['Brains'][0]] => ['Brains'])
  • And then the first element of that list is indexed ([['Brains'][0]][0] => 'Brains')
like image 21
SethMMorton Avatar answered Nov 16 '22 15:11

SethMMorton


[ ] and [[ ]] are the concept of NumPy.

Try to understand the basics of np.array creating and use reshape and check with ndim, you'll understand. Check my answer here.

https://stackoverflow.com/a/70194733/7660981

like image 1
mangal pavan Avatar answered Nov 16 '22 15:11

mangal pavan


Other solutions demonstrate the difference between a series and a dataframe. For the Mathematically minded, you may wish to consider the dimensions of your input and output. Here's a summary:

Object                                Series          DataFrame
Dimensions (obj.ndim)                      1                  2
Syntax arg dim                             0                  1
Syntax                             df['col']        df[['col']]
Max indexing dim                           1                  2
Label indexing              df['col'].loc[x]   df.loc[x, 'col']
Label indexing (scalar)      df['col'].at[x]    df.at[x, 'col']
Integer indexing           df['col'].iloc[x]  df.iloc[x, 'col']
Integer indexing (scalar)   df['col'].iat[x]   dfi.at[x, 'col']

When you specify a scalar or list argument to pd.DataFrame.__getitem__, for which [] is syntactic sugar, the dimension of your argument is one less than the dimension of your result. So a scalar (0-dimensional) gives a 1-dimensional series. A list (1-dimensional) gives a 2-dimensional dataframe. This makes sense since the additional dimension is the dataframe index, i.e. rows. This is the case even if your dataframe happens to have no rows.

like image 1
jpp Avatar answered Nov 16 '22 14:11

jpp