I am storing (x, y)
values in a dataframe. I want to return the most frequently appearing (x, y)
combination.
Here is an example:
> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
x y
1 1 1
2 1 1
3 2 5
4 3 6
5 4 9
6 5 10
7 6 12
The most common (x, y)
value would be (1, 1)
.
I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.
> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL
How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?
EDIT: c(1, 2)
should be considered distinct from c(2, 1)
.
To find the common elements between two columns of an R data frame, we can use intersect function.
To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.
repeat(3) will create a list where each index value will be repeated 3 times and df. iloc[df. index. repeat(3),:] will help generate a dataframe with the rows as exactly returned by this list.
Not sure how will the desired output should look like, but here's a possible solution
res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1
# 2
In order to get the actual values, one could do
res <- do.call(paste, xy)
xy[which.max(ave(seq(res), res, FUN = length)), ]
# x y
# 1 1 1
(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches
res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))
might be simple and effective.)
Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy
to integers ranging from 1 to the number of unique values in the column
x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))
Then encode the combination of columns to unique values
v = x + (max(x) - 1L) * y
Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table()
in other answers) to the minimum, without creating character vectors.
If one wanted to most common occurrence in a single dimension, then one could index and tabulate v
tbl = tabulate(match(v, v))
and find the index of the first occurrence of the maximum value(s), e.g.,
df[which.max(tbl),]
Here's a function to do the magic
whichpairmax <- function(x, y) {
x = match(x, unique(x)); y = match(y, unique(y))
v = x + (max(x) - 1L) * y
which.max(tabulate(match(v, v)))
}
and a couple of tests
> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
x y
1 1 1
> xy1
x y
3 2 5
5 4 9
7 6 12
4 3 6
6 5 10
1 1 1
2 1 1
For an arbitrary data.frame
whichdfmax <- function(df) {
v = integer(nrow(df))
for (col in df) {
col = match(col, unique(col))
v = col + (max(col) - 1L) * match(v, unique(v))
}
which.max(tabulate(match(v, v)))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With