Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the most frequent value by row

My problem is as follows:

I have a data set containing several factor variables, which have the same categories. I need to find the category, which occurs most frequently for each row. In case of ties an arbitrary value can be chosen, although it would be great if I can have more control over it.

My data set contains over a hundred factors. However, the structure is something like that:

df = data.frame(id = 1:3
                var1 = c("red","yellow","green")
                var2 = c("red","yellow","green")
                var3 = c("yellow","orange","green")
                var4 = c("orange","green","yellow"))

df
#   id   var1   var2   var3   var4
# 1  1    red    red yellow orange
# 2  2 yellow yellow orange  green
# 3  3  green  green  green yellow

The solution should be a variable within the data frame, for example var5, which contains the most frequent category for each row. It can be a factor or a numeric vector (in case the data need to be converted first to numeric vectors)

In this case, I would like to have this solution:

df$var5
# [1] "red"    "yellow" "green" 

Any advice will be much appreciated! Thanks in advance!

like image 877
ZMacarozzi Avatar asked Nov 14 '13 16:11

ZMacarozzi


People also ask

How do I find the most frequent value in R?

To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.

What is the most frequent value?

The mode is the most frequent value. The median is the middle number in an ordered data set. The mean is the sum of all values divided by the total number of values.

What is the most repeated value in a data set?

Mode is the highest occurring figure in a series. It is the value in a series of observation that repeats maximum number of times and which represents the whole series as most of the values in the series revolves around this value. Therefore, mode is the value that occurs the most frequent times in a series.


1 Answers

Something like :

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 

In case there is a tie, which.max takes the first max value. From the which.max help page :

Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.

Ex :

var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))

> df
  id   var1   var2   var3   var4
1  1    red    red yellow yellow
2  2 yellow yellow orange  green
3  3  green  green  green yellow

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 
like image 75
Chargaff Avatar answered Oct 05 '22 23:10

Chargaff