I have the following data table x <pre class="prettyprint"><code>id1 id2 a x a x a y b z </code></pre> For each combination of id1, id2 I can find the number of instances in the following way <pre class="prettyprint"><code>x[,list( freq = .N ),by = "id1,id2"] </code></pre> The above would yield <pre class="prettyprint"><code>a x 2 a y 1 b z 1 </code></pre> Next I want to find the most frequent id2 for each id1, i.e. mode. So the expected result is <pre class="prettyprint"><code> a x 2 b z 1 </code></pre> I can get there in a round about way, but is there a way to put a sequence number at the id1 level? Or some such hack that gets me to this efficiently and quickly, perhaps at the first step shown above? Thanks in advance

I'd do it this way: <pre class="prettyprint"><code>setkey(dt[, list(freq = .N), by=list(id1, id2)], id1, freq)[J(unique(id1)), mult="last"] id1 id2 freq 1: a x 2 2: b z 1 </code></pre> The idea is to first get the <code>freq</code> column (as you did). Then <code>setkey</code> on the resulting <code>data.table</code> with columns <code>id1</code> and <code>freq</code>. This'll sort <code>freq</code> in ascending order already. With this, we can then do a <code>by-without-by</code> subsetting and combine it with <code>mult="last"</code> (because for every group, the last value will be the biggest, as it's sorted in ascending order). This'll save a <code>sort</code> step for each grouping which can get time-consuming with increasing number of groups. Note that this does not handle ties. That is, if you've for same <code>id1</code> two equal max values, then only one will be returned.

R data.table finding the mode for a group of data

Tags:

r

data.table

I have the following data table x

id1 id2
a  x
a  x
a  y
b  z

For each combination of id1, id2 I can find the number of instances in the following way

x[,list(
    freq = .N
   ),by = "id1,id2"]

The above would yield

a x 2
a y 1
b z 1

Next I want to find the most frequent id2 for each id1, i.e. mode. So the expected result is

 a x 2
 b z 1

I can get there in a round about way, but is there a way to put a sequence number at the id1 level? Or some such hack that gets me to this efficiently and quickly, perhaps at the first step shown above? Thanks in advance

378

asked Aug 14 '13 22:08

broccoli

1 Answers

I'd do it this way:

setkey(dt[, list(freq = .N), by=list(id1, id2)], 
         id1, freq)[J(unique(id1)), mult="last"]
   id1 id2 freq
1:   a   x    2
2:   b   z    1

The idea is to first get the freq column (as you did). Then setkey on the resulting data.table with columns id1 and freq. This'll sort freq in ascending order already. With this, we can then do a by-without-by subsetting and combine it with mult="last" (because for every group, the last value will be the biggest, as it's sorted in ascending order).

This'll save a sort step for each grouping which can get time-consuming with increasing number of groups. Note that this does not handle ties. That is, if you've for same id1 two equal max values, then only one will be returned.

answered Nov 07 '22 02:11

Arun

Related questions
                            
                                is there a concept of Shortcuts/Alias/Pointer in R?
                            
                                Replace string unless between two points
                            
                                Change text color for single facets in ggplot2
                            
                                Optimizing for Vector Using Optimize R
                            
                                How to continue function when error is thrown in withCallingHandlers in R
                            
                                R data.table replacing an index of values from another data.table
                            
                                Collapse runs of consecutive numbers to ranges
                            
                                Why is R's implementation of the Douglas-Peucker algorithm so slow?
                            
                                Select names of columns which contain specific values in row
                            
                                Using R, How can I flag sequential duplicate values in a single column of a dataframe
                            
                                parallel foreach loops produce mclapply error
                            
                                using lm(poly) to get formula coeff [duplicate]
                            
                                cannot compile rinside with armadillo examples
                            
                                How do I subset a list in R by selecting all elements in a list except for one value?
                            
                                Why do these RNG's in C++ and R not produce similar results?
                            
                                Populate list with same object efficiently
                            
                                Plotting crop calendars
                            
                                ylim in hclust plot
                            
                                Convert julian day to day/month/year
                            
                                substring + get words around a keyword

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With