Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve the most repeated (x, y) values in two columns in a data frame

Tags:

dataframe

r

I am storing (x, y) values in a dataframe. I want to return the most frequently appearing (x, y) combination.

Here is an example:

> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
  x  y
1 1  1
2 1  1
3 2  5
4 3  6
5 4  9
6 5 10
7 6 12

The most common (x, y) value would be (1, 1).

I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.

> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL

How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?

EDIT: c(1, 2) should be considered distinct from c(2, 1).

like image 383
user4605941 Avatar asked Apr 28 '15 13:04

user4605941


People also ask

How do you find the common values in two columns in R?

To find the common elements between two columns of an R data frame, we can use intersect function.

How do you find the most repeated value in R?

To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.

How do you repeat rows in a data frame?

repeat(3) will create a list where each index value will be repeated 3 times and df. iloc[df. index. repeat(3),:] will help generate a dataframe with the rows as exactly returned by this list.


2 Answers

Not sure how will the desired output should look like, but here's a possible solution

res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1 
#   2 

In order to get the actual values, one could do

res <- do.call(paste, xy) 
xy[which.max(ave(seq(res), res, FUN = length)), ]
#   x y
# 1 1 1
like image 134
David Arenburg Avatar answered Sep 23 '22 11:09

David Arenburg


(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches

res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))

might be simple and effective.)

Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy to integers ranging from 1 to the number of unique values in the column

x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))

Then encode the combination of columns to unique values

v = x + (max(x) - 1L) * y

Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table() in other answers) to the minimum, without creating character vectors.

If one wanted to most common occurrence in a single dimension, then one could index and tabulate v

tbl = tabulate(match(v, v))

and find the index of the first occurrence of the maximum value(s), e.g.,

df[which.max(tbl),]

Here's a function to do the magic

whichpairmax <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}

and a couple of tests

> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
  x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
  x y
1 1 1
> xy1
  x  y
3 2  5
5 4  9
7 6 12
4 3  6
6 5 10
1 1  1
2 1  1

For an arbitrary data.frame

whichdfmax <- function(df) {
    v = integer(nrow(df))
    for (col in df) {
        col = match(col, unique(col))
        v = col + (max(col) - 1L) * match(v, unique(v))
    }
    which.max(tabulate(match(v, v)))
}
like image 33
Martin Morgan Avatar answered Sep 22 '22 11:09

Martin Morgan