Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do the individual Series contained within a DataFrame maintain their own index?

Tags:

python

pandas

Consider the dataframe df

df = pd.DataFrame(dict(A=[1, 2, 3]))
df

   A
0  1
1  2
2  3

Now I'll assign to a variable a the series df.A

a = df.A
a

0    1
1    2
2    3
Name: A, dtype: int64

I'll now augment a's index

a.index = a.index + 1
print(a)
print()
print(df)

1    1
2    2
3    3
Name: A, dtype: int64

   A
0  1
1  2
2  3

Nothing to see here. Everything as expected...
But now I'm going to reassign a = df.A

a = df.A
print(a)
print()
print(df)

1    1
2    2
3    3
Name: A, dtype: int64

   A
0  1
1  2
2  3

I just reassigned a directly from df. df's index is what it was, but a's index is different. It's what it was after I augmented it and before I reassigned it.

Of course, if I re construct df everything is reset.

df = pd.DataFrame(dict(A=[1, 2, 3]))
a = df.A
print(a)
print()
print(df)

0    1
1    2
2    3
Name: A, dtype: int64

   A
0  1
1  2
2  3

But that must mean that the pd.Series object that is being tracked inside the pd.DataFrame object, keeps track of it's own index that isn't exactly visible at the pd.DataFrame level.

Question
Am I interpreting this correctly?

It even leads to weirdness like this:

pd.concat([df, df.A], axis=1)

     A    A
0  1.0  NaN
1  2.0  1.0
2  3.0  2.0
3  NaN  3.0
like image 502
piRSquared Avatar asked Apr 21 '17 14:04

piRSquared


Video Answer


2 Answers

This looks like either a bug or unintended consequence of python object identities, prior to the assignment we can see that the indices are the same:

In [175]:
df = pd.DataFrame(dict(A=[1, 2, 3]))
df

Out[175]:
   A
0  1
1  2
2  3

In [176]:
print(id(df.index))
print(id(df['A']))
print(id(df['A'].index))
a = df.A
a

132848496
135123240
132848496

Out[176]:
0    1
1    2
2    3
Name: A, dtype: int64

Now if we modify our reference, the indices now become distinct objects and both a and df['A'] are the same:

In [177]:
a.index = a.index + 1
print(a)
print(id(a))
print(id(df.A))
print()
print(df)
print(id(df.A.index))
print(id(a.index))

1    1
2    2
3    3
Name: A, dtype: int64
135123240
135123240

   A
0  1
1  2
2  3
135125144
135125144

but now df.index is distinct from df['A'].index and a.index:

In [181]:
print(id(df.index))
print(id(a.index))
print(id(df['A'].index))

132848496
135124808
135124808

Personally I'd consider this an unintended consequence as it's difficult once you take the reference a to column 'A' what should the original df do once you start to mutate the reference and I bet this is even harder to catch than the usual Setting on copy warning

In order to avoid this it's best to call copy() to make a deep copy so that any mutations don't affect the orig df:

In [183]:
df = pd.DataFrame(dict(A=[1, 2, 3]))
a = df['A'].copy()
a.index = a.index+1
print(a)
print(df['A'])
print(df['A'].index)
print(df.index)
print()
print(id(df['A']))
print(id(a))
print(id(df['A'].index))
print(id(a.index))

1    1
2    2
3    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
RangeIndex(start=0, stop=3, step=1)
RangeIndex(start=0, stop=3, step=1)

135125984
135165376
135165544
135125816
like image 58
EdChum Avatar answered Oct 13 '22 01:10

EdChum


it's the game of references(pointers), each DataFrame has its own index array, series in the DataFrame have references to the same index array

when a.index = a.index + 1 is executed the reference in the series was changed so a.index is the same as df.A.index which is different than df.index

now if you try to clear df cache, this will reset the series :

print(df.A.index)
df._clear_item_cache()
print(df.A.index)

by default series indexes inside the DataFrame are immutable but copying the series reference allowed a workaround to edit the index reference

like image 36
lsalamon Avatar answered Oct 13 '22 01:10

lsalamon