Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting and Re-arranging Pandas DataFrame Using For Loop

I have a below data frame

df = pd.DataFrame([['NY','R',1],
                  ['NJ','Y',12],
                  ['FL','B',20],
                  ['CA','B',40],
                  ['AZ','Y',51],
                  ['NY','R',2],
                  ['NJ','Y',18],
                  ['FL','B',30],
                  ['CA','B',20],
                  ['AZ','Y',45],
                  ['NY','Y',3],
                  ['NJ','R',15],
                  ['FL','R',10],
                  ['CA','R',70],
                  ['AZ','B',25],
                  ['NY','B',4],
                  ['NJ','B',17],
                  ['FL','Y',30],
                  ['CA','R',30],
                  ['AZ','B',75],
                  ['FL','R',5],
                  ['FL','Y',25],
                  ['NJ','R',14],
                  ['NJ','B',11],
                  ['NY','B',5],
                  ['NY','Y',7]],
                    columns = ['State', 'ID','data'])


State ID  data
0     NY  R     1
1     NJ  Y    12
2     FL  B    20
3     CA  B    40
4     AZ  Y    51
5     NY  R     2
6     NJ  Y    18
7     FL  B    30
8     CA  B    20
9     AZ  Y    45
10    NY  Y     3
11    NJ  R    15
12    FL  R    10
13    CA  R    70
14    AZ  B    25
15    NY  B     4
16    NJ  B    17
17    FL  Y    30
18    CA  R    30
19    AZ  B    75
20    FL  R     5
21    FL  Y    25
22    NJ  R    14
23    NJ  B    11
24    NY  B     5
25    NY  Y     7

What I want to do: to re-create a new data frame such that it only contains the smallest number from eachID of the state. For example: for State: NY and ID: R, there are 2 data: 1 and 2. The new dataframe will only take value: 1 for category State: NY and ID: R. The new data frame shall preferably look like this:

  State  dataR  dataB  dataY
0    NY    1.0      4    3.0
1    NJ   14.0     11   12.0
2    FL    5.0     20   25.0
3    CA   30.0     20    NaN
4    AZ    NaN     25   45.0

Note that: State AZ and CA does not have any value (NaN) for column dataR and dataY, respectively, in the result because they originally have no such value in the original data frame. Please also note that the column in the result becomes dataR, dataB and dataY. I aim to create these columns in the result such that one can easily read the result later on in the actual data.

AND: I also want to be flexible such that I can seek for minimum value of each State in data per each ID R&Y together and B, so the new data frame will look like:

  State  dataRY  dataB
0    NY       1      4
1    NJ      12     11
2    FL       5     20
3    CA      30     20
4    AZ      45     25

I tried using for loops as below:

colours = [['R'],['B'],['Y']]

def rearranging(df):
    df_result = []
    for c in colours:
        df_colours          = df[df['ID'].isin(c)]
        df_colours_result   = []
        for state in np.unique(df['State'].values):
            df1     = df_colours[df_colours['State'] == state]
            df2     = df1.nsmallest(1,'data')
            df_colours_result.append(df2)
        first_loop_result = pd.concat(df_colours_result,ignore_index = True, sort = False)
        df_result.append(first_loop_result)
    final_result = pd.concat(df_result, axis = 1)
    return final_result

The variable colours should be there because I want to be flexible and such that I can change their values if the data source changes later in the time.

And the result of the above for loop is:

  State   ID  data State ID  data State   ID  data
0    CA    R  30.0    AZ  B    25    AZ    Y  45.0
1    FL    R   5.0    CA  B    20    FL    Y  25.0
2    NJ    R  14.0    FL  B    20    NJ    Y  12.0
3    NY    R   1.0    NJ  B    11    NY    Y   3.0
4   NaN  NaN   NaN    NY  B     4   NaN  NaN   NaN

I don't like my result because: it is hard to read and I need to re-arrange and rename the columns again. Is there anyway to get the result that I actually aim at the above by using for loops? Vectorization is also welcome.

Please also be informed (once again) that I also want to be flexible on the column ID. This is the reason for me to include for example I want to say I need to see the smallest value of data for each State for ID R&Y combined and ID B. In my attempt I simply alter the code as below with the loop stays the same:

colours = [['R','Y'],['B']]


And the result is:

  State ID  data State ID  data
0    AZ  Y    45    AZ  B    25
1    CA  R    30    CA  B    20
2    FL  R     5    FL  B    20
3    NJ  Y    12    NJ  B    11
4    NY  R     1    NY  B     4

Note: in comparison and if there is NaN, then NaN is simply ignored (and not treated as Zero).

Once again the result is not the same as what I aim for and this table is not informative enough.

like image 813
teteh May Avatar asked May 20 '26 12:05

teteh May


1 Answers

IIUC, use groupby() on State and ID and get min of data column, and unstack(add_prefix) if required.:

df.groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_')

ID     data_R  data_Y  data_B
State                        
NY        1.0     3.0     4.0
NJ       14.0    12.0    11.0
FL        5.0    25.0    20.0
CA       30.0     NaN    20.0
AZ        NaN    45.0    25.0

EDIT: As requested by OP , if you want to merge Y and R together, just replace and do similar:

(df.assign(ID=df['ID'].replace(['Y','R'],'YR'))
    .groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_'))

ID     data_YR  data_B
State                 
NY           1       4
NJ          12      11
FL           5      20
CA          30      20
AZ          45      25
like image 102
anky Avatar answered May 22 '26 01:05

anky



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!