I have a Pandas dataframe where I have designated some of the columns as indices:
planets_dataframe.set_index(['host','name'], inplace=True)
and would like to be able to refer to these indices in a variety of contexts. Using the name of an index works fine in queries
planets_dataframe.query('host == "PSR 1257 12"')
but results in an error if try to use it to get a list of the values of an index as I could when it was a column
planets_dataframe.name
#AttributeError: 'DataFrame' object has no attribute 'name'
or to use it to list results as I could when it was a "regular" column
planets_dataframe.query('30 > mass > 20 and discoveryyear > 2009')['name']
#KeyError: u'no item named name'
How do I refer to the "columns" of the dataframe that I'm using as indexes?
Before set_index
:
planets_dataframe.columns
# Index([u'name', u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'host', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')
After set_index
:
planets_dataframe.columns
#Index([u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')
The information is accessible using the index's get_level_values
method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(10,4)), columns=list('ABCD'))
idf = df.set_index(list('AB'))
idf.index.get_level_values('A')
is roughly equivalent to df['A']
. Note the change in type and dtype, however:
print(df['A'])
# 0 1
# 1 3
# 2 3
# 3 0
# 4 2
# 5 2
# 6 3
# 7 1
# 8 3
# 9 3
# Name: A, dtype: int32
def level(df, lvl):
return df.index.get_level_values(lvl)
print(level(idf, 'A'))
# Int64Index([1, 3, 3, 0, 2, 2, 3, 1, 3, 3], dtype='int64')
And here again, instead of selecting the column with ['A']
, you can get the equivalent information using .index.get_level_values('A')
:
print(df.query('3>C>0 and D>0')['A'])
# 8 3
# Name: A, dtype: int32
print(level(idf.query('3>C>0 and D>0'), 'A'))
# Int64Index([3], dtype='int64')
PS. One of the golden rules of database design is, "Never repeat the same data in two places" since sooner or later the data will become inconsistent and thus corrupted. So I would recommend against keeping the data as both a column and an index, primarily because it could lead to data corruption, but also because it could be an inefficient use of memory.
I think you have a slight misunderstanding of what indexes are. You don't just "designate" columns as indexes; that is, you don't just "tag" certain columns with info that says "this is an index". The index is a separate data structure that can hold data that aren't even present in the columns. If you do set_index
, you move those columns into the index, so they no longer exist as regular columns. This is why you can no longer use them in the ways you mention: they aren't there anymore.
One thing you can do is, when using set_index
, pass drop=False
to tell it to keep the columns as columns in addition to putting them in the index (effectively copying them to the index rather than moving them), e.g., df.set_index('SomeColumn', drop=False)
. However, you should be aware that the index and column are still distinct, so for instance if you modify the column values this will not affect what's stored in the index.
The upshot is that indexes aren't really columns of the DataFrame, so if you want to be able to use some data as both an index and a column, you need to duplicate it in both places. There is some discussion of this issue here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With