Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the max of all the named categories in a pandas table

So imagine I have data like this:

  name  time
0    A     1
1    A     2
2    B     3
3    A     6
4    A     7
5    A     3
6    B     1
7    B     4

Each entry has a named category and some other information. In the example above lets take time. Its the only one i care about.

I would like to produce a table which has just the individual unique name categopries and the max of each. I can do something like this:

max_table = pd.DataFrame(
    {
        "name": data.name.unique(),
        "max_val": [
            data[data["name"] == name].time.max() for name in data.name.unique()
        ],
    }
)

But this does not feel very pandas-y. I have to go to and from the pandas table between lists and do some array expansion to make this work. Is there a way to do this with just a pandas type call?

Full example including data creation:

    data = pd.DataFrame(
        {
            "name": pd.Categorical(["A", "A", "B", "A", "A", "A", "B", "B"]),
            "time": [1, 2, 3, 6, 7, 3, 1, 4],
        }
    )
    print(data)
    print("======================")

    max_table = pd.DataFrame(
        {
            "name": data.name.unique(),
            "max_val": [
                data[data["name"] == name].time.max() for name in data.name.unique()
            ],
        }
    )
    print(max_table)
like image 538
Fantastic Mr Fox Avatar asked Nov 24 '25 17:11

Fantastic Mr Fox


1 Answers

The first thing to get to know about Pandas or Numpy is Vectorized manipulation(like Matlab dealing data with matrix/arrays). Which can process multiple items of data at once instead of for loop operation.




groupby() clustering data into groups, and max() find the maximum value of each group.

output = df.groupby('name').max()
output
###
      time
name      
A        7
B        4

The remaining part is just to reconstruct the table(DataFrame) structure.

(As you can see, name is lower than time, which means name is set to be index of the table(DataFrame), and time is a column's name of the table)

output = output.reset_index().rename(columns={'time':'max_time'})
output
###
  name  max_time
0    A         7
1    B         4
like image 199
Baron Legendre Avatar answered Nov 26 '25 05:11

Baron Legendre



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!