Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table finding the mode for a group of data

Tags:

r

data.table

I have the following data table x

id1 id2
a  x
a  x
a  y
b  z

For each combination of id1, id2 I can find the number of instances in the following way

x[,list(
    freq = .N
   ),by = "id1,id2"]

The above would yield

a x 2
a y 1
b z 1

Next I want to find the most frequent id2 for each id1, i.e. mode. So the expected result is

 a x 2
 b z 1

I can get there in a round about way, but is there a way to put a sequence number at the id1 level? Or some such hack that gets me to this efficiently and quickly, perhaps at the first step shown above? Thanks in advance

like image 378
broccoli Avatar asked Aug 14 '13 22:08

broccoli


People also ask

How do you find the mode of data in R?

R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R. This function takes the vector as input and gives the mode value as output.

Is there a mode function in R?

In R, mean() and median() are standard functions which do what you'd expect. mode() tells you the internal storage mode of the object, not the value that occurs the most in its argument.

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.

What are data tables?

A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.


1 Answers

I'd do it this way:

setkey(dt[, list(freq = .N), by=list(id1, id2)], 
         id1, freq)[J(unique(id1)), mult="last"]
   id1 id2 freq
1:   a   x    2
2:   b   z    1

The idea is to first get the freq column (as you did). Then setkey on the resulting data.table with columns id1 and freq. This'll sort freq in ascending order already. With this, we can then do a by-without-by subsetting and combine it with mult="last" (because for every group, the last value will be the biggest, as it's sorted in ascending order).

This'll save a sort step for each grouping which can get time-consuming with increasing number of groups. Note that this does not handle ties. That is, if you've for same id1 two equal max values, then only one will be returned.

like image 64
Arun Avatar answered Nov 07 '22 02:11

Arun