I ran the following Python code, which creates a Pandas DataFrame with two Series (a
and b
), and then attempts to create two new Series (c
and d
):
import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df['c'] = df.a + df.b
df.d = df.a + df.b
My understanding is that if a Pandas Series is part of a DataFrame, and the Series name does not have any spaces (and does not collide with an existing attribute or method), the Series can be accessed as an attribute of the DataFrame. As such, I expected that line 3 would work (since that's how you create a new Pandas Series), and I expected that line 4 would fail (since the d
attribute does not exist for the DataFrame until after you execute that line of code).
To my surprise, line 4 did not result in an error. Instead, the DataFrame now contains three Series:
>>> df
a b c
0 1 4 5
1 2 5 7
2 3 6 9
And there is a new object, df.d
, which is a Pandas Series:
>>> df.d
0 5
1 7
2 9
dtype: int64
>>> type(df.d)
pandas.core.series.Series
My questions are as follows:
df.d
now a "normal" Pandas Series with all of the regular Series functionality?df.d
in any way "connected" to the df
DataFrame, or is it a completely independent object?My motivation in asking this question is simply that I want to better understand Pandas, and not because there is a particular use case for line 4.
My Python version is 2.7.11, and my Pandas version is 0.17.1.
When doing assignment, you need to use bracket notation, e.g. df['d'] = ...
d
is now a property of the dataframe df
. As with any object, you can assign properties to them. That is why it did not generate the error. It just didn't behave as you expected...
df.some_property = 'What?'
>>> df.some_property
'What?'
This is a common area of misunderstanding for beginners to Pandas. Always use bracket notation for assignment. The dot notation is for convenience when referencing the dataframe/series. To be safe, you could always use bracket notation.
And yes, df.d
per your example is a normal series that is now an unexpected property of the dataframe. This series is its own object, connected by the reference you created when you assigned it to df
.
@Alexander's answer is good. But just to clarify, it's not a pandas's specificity, but rather a python's specificity, see here for a related question:
Why is adding attributes to an already instantiated object allowed in Python?
As for your last question, the Series is not connected (depends on what you mean by connected though). But, imagine this:
df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df.d = df.a + df.b
df.sort("a", ascending=False, inplace=True)
df
a b
2 3 6
1 2 5
0 1 4
df.d
0 5
1 7
2 9
dtype: int64
So df.d
has not been sorted, whereas df.a
and df.b
have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With