Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting a pandas DataFrame by the order of a list

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.

The list looks something like this:

class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']

This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?

like image 270
Wes Field Avatar asked Oct 05 '14 13:10

Wes Field


2 Answers

You could make the Class column your index column

df = df.set_index('Class')

and then use df.loc to reindex the DataFrame with class_list:

df.loc[class_list]

Minimal example:

>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
                 Class  Number
0  Gammaproteobacteria       3
1        Bacteroidetes       5
2        Negativicutes       6

>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
                     Number
Bacteroidetes             5
Negativicutes             6
Gammaproteobacteria       3
like image 90
Alex Riley Avatar answered Sep 29 '22 15:09

Alex Riley


Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:

ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']

df_list = []

for i in ordered_classes:
   df_list.append(df[df['Class']==i])

ordered_df = pd.concat(df_list)
like image 23
jarvis Avatar answered Sep 29 '22 16:09

jarvis