Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Pandas Series with a period in the name

Tags:

python

pandas

I ran the following Python code, which creates a Pandas DataFrame with two Series (a and b), and then attempts to create two new Series (c and d):

import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df['c'] = df.a + df.b
df.d = df.a + df.b

My understanding is that if a Pandas Series is part of a DataFrame, and the Series name does not have any spaces (and does not collide with an existing attribute or method), the Series can be accessed as an attribute of the DataFrame. As such, I expected that line 3 would work (since that's how you create a new Pandas Series), and I expected that line 4 would fail (since the d attribute does not exist for the DataFrame until after you execute that line of code).

To my surprise, line 4 did not result in an error. Instead, the DataFrame now contains three Series:

>>> df
   a  b  c
0  1  4  5
1  2  5  7
2  3  6  9

And there is a new object, df.d, which is a Pandas Series:

>>> df.d
0    5
1    7
2    9
dtype: int64

>>> type(df.d)
pandas.core.series.Series

My questions are as follows:

  • Why did line 4 not result in an error?
  • Is df.d now a "normal" Pandas Series with all of the regular Series functionality?
  • Is df.d in any way "connected" to the df DataFrame, or is it a completely independent object?

My motivation in asking this question is simply that I want to better understand Pandas, and not because there is a particular use case for line 4.

My Python version is 2.7.11, and my Pandas version is 0.17.1.

like image 401
Kevin Markham Avatar asked Mar 07 '16 17:03

Kevin Markham


2 Answers

When doing assignment, you need to use bracket notation, e.g. df['d'] = ...

d is now a property of the dataframe df. As with any object, you can assign properties to them. That is why it did not generate the error. It just didn't behave as you expected...

df.some_property = 'What?'
>>> df.some_property
'What?'

This is a common area of misunderstanding for beginners to Pandas. Always use bracket notation for assignment. The dot notation is for convenience when referencing the dataframe/series. To be safe, you could always use bracket notation.

And yes, df.d per your example is a normal series that is now an unexpected property of the dataframe. This series is its own object, connected by the reference you created when you assigned it to df.

like image 179
Alexander Avatar answered Oct 06 '22 19:10

Alexander


@Alexander's answer is good. But just to clarify, it's not a pandas's specificity, but rather a python's specificity, see here for a related question:

Why is adding attributes to an already instantiated object allowed in Python?

As for your last question, the Series is not connected (depends on what you mean by connected though). But, imagine this:

df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df.d = df.a + df.b
df.sort("a", ascending=False, inplace=True)
df
   a  b
2  3  6
1  2  5
0  1  4

df.d
0    5
1    7
2    9
dtype: int64

So df.d has not been sorted, whereas df.a and df.b have.

like image 32
jrjc Avatar answered Oct 06 '22 20:10

jrjc