I'm working my way through Pandas for Data Analysis and learning a ton. However, one thing keeps coming up. The book typically refers to columns of a dataframe as df['column']
however, sometimes without explanation the book uses df.column
.
I don't understand the difference between the two. Any help would be appreciated.
Below is come code demonstrating the what I am talking about:
In [5]:
import pandas as pd
data = {'column1': ['a', 'a', 'a', 'b', 'c'],
'column2': [1, 4, 2, 5, 3]}
df = pd.DataFrame(data, columns = ['column1', 'column2'])
df
Out[5]:
column1 column2
0 a 1
1 a 4
2 a 2
3 b 5
4 c 3
5 rows × 2 columns
df.column:
In [8]:
df.column1
Out[8]:
0 a
1 a
2 a
3 b
4 c
Name: column1, dtype: object
df['column']:
In [9]:
df['column1']
Out[9]:
0 a
1 a
2 a
3 b
4 c
Name: column1, dtype: object
When you write df["] you are basically accessing a set of number values, but when you use df[["]] you are getting a DataFrame object which is compatible with your code. Show activity on this post.
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.
We created a dictionary, and the values for each column are given. Then it is converted into a pandas dataframe. By using the Where() method in NumPy, we are given the condition to compare the columns.
The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter. If the axis parameter is set to axes='columns' , the method finds the difference column by column instead of row by row.
for setting, values, you need to use df['column'] = series
.
once this is done however, you can refer to that column in the future with df.column
, assuming it's a valid python name. (so df.column
works, but df.6column
would still have to be accessed with df['6column']
)
i think the subtle difference here is that when you set something with df['column'] = ser
, pandas goes ahead and adds it to the columns/does some other stuff (i believe by overriding the functionality in __setitem__
. if you do df.column = ser
, it's just like adding a new field to any existing object which uses __setattr__
, and pandas does not seem to override this behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With