I am looking to sort a dataframe. I have this dataframe:
Y X1 X2 X3
Y1 1 0 1
Y2 1 0 0
Y3 1 0 0
Y4 0 1 0
There are a lot of columns. I want to select the X values with the largest sum if you added down the columns.
I have been trying to do this by adding a row like so:
Y X1 X2 X3
Y1 1 0 1
Y2 1 0 0
Y3 1 0 0
Y4 0 1 1
sum 3 1 2
and then I would sort it by the sum row
Y X1 X3 X2
Y1 1 1 0
Y2 1 0 0
Y3 1 0 0
Y4 0 1 1
sum 3 2 1
and select 30 columns to use. However, I can only get a sum of the rows like so:
Y X1 X3 X2 sum
Y1 1 1 0 2
Y2 1 0 0 1
Y3 1 0 0 1
Y4 0 1 1 2
using
pivot_table['sum'] = pivot_table.sum(axis=1)
I also tried
pivot_table['sum'] = pivot_table.sum(axis=0)
and attempted to add .transpose() but this isn't working. I also think there is probably a faster way to do this than the step-by-step attempt I am making.
You can call sum on the df, this will return a Series, you can then sort this series and then use the index of the series to reorder your df:
In [249]:
# note that column 'X3' will produce a sum value of 2
t="""Y X1 X2 X3
Y1 1 0 1
Y2 1 0 1
Y3 1 0 0
Y4 0 1 0"""
# load the data
df = pd.read_csv(io.StringIO(t), sep='\s+', index_col=[0])
df
Out[249]:
X1 X2 X3
Y
Y1 1 0 1
Y2 1 0 1
Y3 1 0 0
Y4 0 1 0
The result from sum will return a series we want to sort this and pass params inplace=False so it returns a copy and ascending=False:
In [250]:
# now calculate the sum, call sort on the series
s = df.sum().sort(ascending=False, inplace=False)
s
Out[250]:
X1 3
X3 2
X2 1
dtype: int64
In [251]:
# now use fancy indexing to reorder the df
df.ix[:,s.index]
Out[251]:
X1 X3 X2
Y
Y1 1 1 0
Y2 1 1 0
Y3 1 0 0
Y4 0 0 1
You can slice the index if you want just the top n columns:
In [254]:
df = df[s.index[:2]]
df
Out[254]:
X1 X3
Y
Y1 1 1
Y2 1 1
Y3 1 0
Y4 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With