I'm confused about the syntax regarding the following line of code:
x_values = dataframe[['Brains']]
The dataframe object consists of 2 columns (Brains and Bodies)
Brains Bodies
42 34
32 23
When I print x_values I get something like this:
Brains
0 42
1 32
I'm aware of the pandas documentation as far as attributes and methods of the dataframe object are concerned, but the double bracket syntax is confusing me.
Consider this:
Source DF:
In [79]: df
Out[79]:
Brains Bodies
0 42 34
1 32 23
Selecting one column - results in Pandas.Series:
In [80]: df['Brains']
Out[80]:
0 42
1 32
Name: Brains, dtype: int64
In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series
Selecting subset of DataFrame - results in DataFrame:
In [82]: df[['Brains']]
Out[82]:
Brains
0 42
1 32
In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame
Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...
Demo:
In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))
In [85]: df
Out[85]:
a b c d e f
0 0.065196 0.257422 0.273534 0.831993 0.487693 0.660252
1 0.641677 0.462979 0.207757 0.597599 0.117029 0.429324
2 0.345314 0.053551 0.634602 0.143417 0.946373 0.770590
3 0.860276 0.223166 0.001615 0.212880 0.907163 0.437295
4 0.670969 0.218909 0.382810 0.275696 0.012626 0.347549
In [86]: df[['e','a','c']]
Out[86]:
e a c
0 0.487693 0.065196 0.273534
1 0.117029 0.641677 0.207757
2 0.946373 0.345314 0.634602
3 0.907163 0.860276 0.001615
4 0.012626 0.670969 0.382810
and if we specify only one column in the list we will get a DataFrame with one column:
In [87]: df[['e']]
Out[87]:
e
0 0.487693
1 0.117029
2 0.946373
3 0.907163
4 0.012626
There is no special syntax in Python for [[
and ]]
. Rather, a list is being created, and then that list is being passed as an argument to the DataFrame indexing function.
As per @MaxU's answer, if you pass a single string to a DataFrame a series that represents that one column is returned. If you pass a list of strings, then a DataFrame that contains the given columns is returned.
So, when you do the following
# Print "Brains" column as Series
print(df['Brains'])
# Return a DataFrame with only one column called "Brains"
print(df[['Brains']])
It is equivalent to the following
# Print "Brains" column as Series
column_to_get = 'Brains'
print(df[column_to_get])
# Return a DataFrame with only one column called "Brains"
subset_of_columns_to_get = ['Brains']
print(df[subset_of_columns_to_get])
In both cases, the DataFrame is being indexed with the []
operator.
Python uses the []
operator for both indexing and for constructing list literals, and ultimately I believe this is your confusion. The outer [
and ]
in df[['Brains']]
is performing the indexing, and the inner is creating a list.
>>> some_list = ['Brains']
>>> some_list_of_lists = [['Brains']]
>>> ['Brains'] == [['Brains']][0]
True
>>> 'Brains' == [['Brains']][0][0] == [['Brains'][0]][0]
True
What I am illustrating above is that at no point does Python ever see [[
and interpret it specially. In the last convoluted example ([['Brains'][0]][0]
) there is no special ][
operator or ]][
operator... what happens is
['Brains']
)['Brains'][0]
=> 'Brains'
)[['Brains'][0]]
=> ['Brains']
)[['Brains'][0]][0]
=> 'Brains'
)[ ]
and [[ ]]
are the concept of NumPy.
Try to understand the basics of np.array
creating and use reshape
and check with ndim
, you'll understand.
Check my answer here.
https://stackoverflow.com/a/70194733/7660981
Other solutions demonstrate the difference between a series and a dataframe. For the Mathematically minded, you may wish to consider the dimensions of your input and output. Here's a summary:
Object Series DataFrame
Dimensions (obj.ndim) 1 2
Syntax arg dim 0 1
Syntax df['col'] df[['col']]
Max indexing dim 1 2
Label indexing df['col'].loc[x] df.loc[x, 'col']
Label indexing (scalar) df['col'].at[x] df.at[x, 'col']
Integer indexing df['col'].iloc[x] df.iloc[x, 'col']
Integer indexing (scalar) df['col'].iat[x] dfi.at[x, 'col']
When you specify a scalar or list argument to pd.DataFrame.__getitem__
, for which []
is syntactic sugar, the dimension of your argument is one less than the dimension of your result. So a scalar (0-dimensional) gives a 1-dimensional series. A list (1-dimensional) gives a 2-dimensional dataframe. This makes sense since the additional dimension is the dataframe index, i.e. rows. This is the case even if your dataframe happens to have no rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With