For each row return the column name of the largest value

Tags:

r

People also ask

How do you find the columns maximum value in every row?

To create the new column 'Max', use df['Max'] = df. idxmax(axis=1) . To find the row index at which the maximum value occurs in each column, use df. idxmax() (or equivalently df.

How do you find the highest value in a row and return column header?

If you want to recover the column header of the largest value in a row, you can use a combination of "INDEX", "MATCH" & "MAX" functions to extract the output. "INDEX": Returns a value or reference of the cell at the intersection of a particular row and column, in a given range.

How do you find the highest value in a row in R?

max() in R The max() is a built-in R function that finds the maximum value of the vector or data frame. It takes the R object as an input and returns the maximum value out of it. To find the maximum value of vector elements, data frame, and columns, use the max() function.

One option using your data (for future reference, use set.seed() to make examples using sample reproducible):

DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))

colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"

A faster solution than using apply might be max.col:

colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"

...where ties.method can be any of "random" "first" or "last"

This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:

DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))

[[1]]
V2 V3 
 2  3 

[[2]]
V1 
 1 

[[3]]
V2 
 2

If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!

Here I've generated data of your dimensions (26746 * 18).

Data

set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))

`data.table` answer:

require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]

Benchmarking:

# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
#   user  system elapsed 
#  0.174   0.029   0.227 

# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
#   user  system elapsed 
#  2.322   0.036   2.602 

identical(t1, t2)
# [1] TRUE

It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.

Edit: if any of the max ids is okay, then:

DT <- data.table(value=unlist(DF, use.names=FALSE), 
            colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]

One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.

library(tidyverse)

# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))

# If you aren't worried about ties:  
df %>% 
  rownames_to_column('id') %>%  # creates an ID number
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  slice(which.max(cnt)) 

# A tibble: 3 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 1     V3       9.
2 2     V1       8.
3 3     V2       5.


# If you're worried about keeping ties:
df %>% 
  rownames_to_column('id') %>%
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
  arrange(id)

# A tibble: 4 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 1     V3       9.
2 2     V1       8.
3 3     V2       5.
4 3     V3       5.


# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>% 
  rownames_to_column('id') %>%
  gather(dept, cnt, V1:V3) %>% 
  group_by(id) %>% 
  mutate(dept_rank  = rank(-cnt, ties.method = "first")) %>% # or 'last'
  filter(dept_rank == 1) %>% 
  select(-dept_rank) 

# A tibble: 3 x 3
# Groups:   id [3]
  id    dept    cnt
  <chr> <chr> <dbl>
1 2     V1       8.
2 3     V2       5.
3 1     V3       9.

# if you wanted to keep the original wide data frame
df %>% 
  rownames_to_column('id') %>%
  left_join(
    df %>% 
      rownames_to_column('id') %>%
      gather(max_dept, max_cnt, V1:V3) %>% 
      group_by(id) %>% 
      slice(which.max(max_cnt)), 
    by = 'id'
  )

# A tibble: 3 x 6
  id       V1    V2    V3 max_dept max_cnt
  <chr> <dbl> <dbl> <dbl> <chr>      <dbl>
1 1        2.    7.    9. V3            9.
2 2        8.    3.    6. V1            8.
3 3        1.    5.    5. V2            5.

Based on the above suggestions, the following data.table solution worked very fast for me:

library(data.table)

set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))

system.time(
  DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#>    user  system elapsed 
#>    0.15    0.06    0.21
DT[]
#>          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#>       1:  7  4  1  2  3  7  6  6  6   1      V1
#>       2:  4  6  9 10  6  2  7  7  1   3      V4
#>       3:  3  4  9  8  9  9  8  8  6   7      V3
#>       4:  4  8  8  9  7  5  9  2  7   1      V4
#>       5:  4  3  9 10  2  7  9  6  6   9      V4
#>      ---                                       
#>  999996:  4  6 10  5  4  7  3  8  2   8      V3
#>  999997:  8  7  6  6  3 10  2  3 10   1      V6
#>  999998:  2  3  2  7  4  7  5  2  7   3      V4
#>  999999:  8 10  3  2  3  4  5  1  1   4      V2
#> 1000000: 10  4  2  6  6  2  8  4  7   4      V1

And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:

DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]

In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:

DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]

Related questions
                            
                                detach all packages while working in R
                            
                                Position geom_text on dodged barplot
                            
                                Extract a regular expression match
                            
                                Error in plot.new() : figure margins too large in R
                            
                                Remove data.frame row names when using xtable
                            
                                Replacement for "rename" in dplyr
                            
                                How do I combine two data-frames based on two columns? [duplicate]
                            
                                R: rJava package install failing
                            
                                How to draw an empty plot?
                            
                                Split column at delimiter in data frame [duplicate]
                            
                                Create empty data frame with column names by assigning a string vector? [duplicate]
                            
                                Filtering a data frame by values in a column [duplicate]
                            
                                Rolling median algorithm in C
                            
                                What does "not run" mean in R help pages?
                            
                                Error in <my code> : object of type 'closure' is not subsettable
                            
                                Increase distance between text and title on the y-axis
                            
                                General suggestions for debugging in R
                            
                                Create a group number for each consecutive sequence
                            
                                Scheduling R Script
                            
                                rJava load error in RStudio/R after "upgrading" to OSX Yosemite

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With