Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3]))
df
A
0 1
1 2
2 3
Now I'll assign to a variable a
the series df.A
a = df.A
a
0 1
1 2
2 3
Name: A, dtype: int64
I'll now augment a
's index
a.index = a.index + 1
print(a)
print()
print(df)
1 1
2 2
3 3
Name: A, dtype: int64
A
0 1
1 2
2 3
Nothing to see here. Everything as expected...
But now I'm going to reassign a = df.A
a = df.A
print(a)
print()
print(df)
1 1
2 2
3 3
Name: A, dtype: int64
A
0 1
1 2
2 3
I just reassigned a
directly from df
. df
's index is what it was, but a
's index is different. It's what it was after I augmented it and before I reassigned it.
Of course, if I re construct df
everything is reset.
df = pd.DataFrame(dict(A=[1, 2, 3]))
a = df.A
print(a)
print()
print(df)
0 1
1 2
2 3
Name: A, dtype: int64
A
0 1
1 2
2 3
But that must mean that the pd.Series
object that is being tracked inside the pd.DataFrame
object, keeps track of it's own index that isn't exactly visible at the pd.DataFrame
level.
Question
Am I interpreting this correctly?
It even leads to weirdness like this:
pd.concat([df, df.A], axis=1)
A A
0 1.0 NaN
1 2.0 1.0
2 3.0 2.0
3 NaN 3.0
This looks like either a bug or unintended consequence of python object identities, prior to the assignment we can see that the indices are the same:
In [175]:
df = pd.DataFrame(dict(A=[1, 2, 3]))
df
Out[175]:
A
0 1
1 2
2 3
In [176]:
print(id(df.index))
print(id(df['A']))
print(id(df['A'].index))
a = df.A
a
132848496
135123240
132848496
Out[176]:
0 1
1 2
2 3
Name: A, dtype: int64
Now if we modify our reference, the indices now become distinct objects and both a
and df['A']
are the same:
In [177]:
a.index = a.index + 1
print(a)
print(id(a))
print(id(df.A))
print()
print(df)
print(id(df.A.index))
print(id(a.index))
1 1
2 2
3 3
Name: A, dtype: int64
135123240
135123240
A
0 1
1 2
2 3
135125144
135125144
but now df.index
is distinct from df['A'].index
and a.index
:
In [181]:
print(id(df.index))
print(id(a.index))
print(id(df['A'].index))
132848496
135124808
135124808
Personally I'd consider this an unintended consequence as it's difficult once you take the reference a
to column 'A'
what should the original df
do once you start to mutate the reference and I bet this is even harder to catch than the usual Setting on copy
warning
In order to avoid this it's best to call copy()
to make a deep copy so that any mutations don't affect the orig df:
In [183]:
df = pd.DataFrame(dict(A=[1, 2, 3]))
a = df['A'].copy()
a.index = a.index+1
print(a)
print(df['A'])
print(df['A'].index)
print(df.index)
print()
print(id(df['A']))
print(id(a))
print(id(df['A'].index))
print(id(a.index))
1 1
2 2
3 3
Name: A, dtype: int64
0 1
1 2
2 3
Name: A, dtype: int64
RangeIndex(start=0, stop=3, step=1)
RangeIndex(start=0, stop=3, step=1)
135125984
135165376
135165544
135125816
it's the game of references(pointers), each DataFrame has its own index array, series in the DataFrame have references to the same index array
when a.index = a.index + 1
is executed the reference in the series was changed so a.index is the same as df.A.index which is different than df.index
now if you try to clear df cache, this will reset the series :
print(df.A.index)
df._clear_item_cache()
print(df.A.index)
by default series indexes inside the DataFrame are immutable but copying the series reference allowed a workaround to edit the index reference
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With