My problem is as follows:
I have a data set containing several factor variables, which have the same categories. I need to find the category, which occurs most frequently for each row. In case of ties an arbitrary value can be chosen, although it would be great if I can have more control over it.
My data set contains over a hundred factors. However, the structure is something like that:
df = data.frame(id = 1:3
var1 = c("red","yellow","green")
var2 = c("red","yellow","green")
var3 = c("yellow","orange","green")
var4 = c("orange","green","yellow"))
df
# id var1 var2 var3 var4
# 1 1 red red yellow orange
# 2 2 yellow yellow orange green
# 3 3 green green green yellow
The solution should be a variable within the data frame, for example var5, which contains the most frequent category for each row. It can be a factor or a numeric vector (in case the data need to be converted first to numeric vectors)
In this case, I would like to have this solution:
df$var5
# [1] "red" "yellow" "green"
Any advice will be much appreciated! Thanks in advance!
To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.
The mode is the most frequent value. The median is the middle number in an ordered data set. The mean is the sum of all values divided by the total number of values.
Mode is the highest occurring figure in a series. It is the value in a series of observation that repeats maximum number of times and which represents the whole series as most of the values in the series revolves around this value. Therefore, mode is the value that occurs the most frequent times in a series.
Something like :
apply(df,1,function(x) names(which.max(table(x))))
[1] "red" "yellow" "green"
In case there is a tie, which.max takes the first max value. From the which.max help page :
Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.
Ex :
var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))
> df
id var1 var2 var3 var4
1 1 red red yellow yellow
2 2 yellow yellow orange green
3 3 green green green yellow
apply(df,1,function(x) names(which.max(table(x))))
[1] "red" "yellow" "green"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With