Would someone please explain to me the correct usage of <code>.I</code> for returning the row numbers of a data.table? I have data like this: <pre class="prettyprint"><code>require(data.table) DT <- data.table(X=c(5, 15, 20, 25, 30)) DT # X # 1: 5 # 2: 15 # 3: 20 # 4: 25 # 5: 30 </code></pre> I want to return a vector of row indices where a condition in <code>i</code> is <code>TRUE</code>, e.g. which rows have an <code>X</code> greater than 20. <pre class="prettyprint"><code>DT[X > 20] # rows 4 & 5 are greater than 20 </code></pre> To get the indices, I tried: <pre class="prettyprint"><code>DT[X > 20, .I] # [1] 1 2 </code></pre> ...but clearly I am doing it wrong, because that simply returns a vector containing 1 to the number of returned rows. (Which I thought was pretty much what <code>.N</code> was for?). Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT <code>.I</code> and <code>.N</code> do, not HOW to use them.

If all you want is the row numbers rather than the rows themselves, then use <code>which = TRUE</code>, not <code>.I</code>. <pre class="prettyprint"><code>DT[X > 20, which = TRUE] # [1] 4 5 </code></pre> That way you get the benefits of optimization of <code>i</code>, for example fast joins or using an automatic index. The <code>which = TRUE</code> makes it return early with just the row numbers. Here's the manual entry for the <code>which</code> argument inside data.table : <blockquote> <code>TRUE</code> returns the row numbers of <code>x</code> that <code>i</code> matches to. If <code>NA</code>, returns the row numbers of <code>i</code> that have no match in <code>x</code>. By default <code>FALSE</code> and the rows in <code>x</code> that match are returned. </blockquote> <hr> <h3>Explanation:</h3> Notice there is a specific relationship between <code>.I</code> and the <code>i = ..</code> argument in <code>DT[i = .., j = .., by = ..]</code> Namely, <code>.I</code> is a vector of row numbers of the subsetted table. <pre class="prettyprint"><code>### Lets create some sample data set.seed(1) LL <- sample(LETTERS[1:5], 20, TRUE) DT <- data.table(X=LL) </code></pre> <h3>look at the difference between subsetting the whole table, and subsetting just <code>.I</code> </h3> <pre class="prettyprint"><code>DT[X == "B", .I] # [1] 1 2 3 4 5 6 DT[ , .I[X == "B"] ] # [1] 1 2 5 11 14 19 </code></pre>

Using .I to return row numbers with data.table package

Tags:

r

data.table

Would someone please explain to me the correct usage of .I for returning the row numbers of a data.table?

I have data like this:

require(data.table) DT <- data.table(X=c(5, 15, 20, 25, 30)) DT #     X # 1:  5 # 2: 15 # 3: 20 # 4: 25 # 5: 30

I want to return a vector of row indices where a condition in i is TRUE, e.g. which rows have an X greater than 20.

DT[X > 20] # rows 4 & 5 are greater than 20

To get the indices, I tried:

DT[X > 20, .I] # [1] 1 2

...but clearly I am doing it wrong, because that simply returns a vector containing 1 to the number of returned rows. (Which I thought was pretty much what .N was for?).

Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.

213

asked Mar 14 '14 14:03

user3351605

2 Answers

If all you want is the row numbers rather than the rows themselves, then use which = TRUE, not .I.

DT[X > 20, which = TRUE] # [1] 4 5

That way you get the benefits of optimization of i, for example fast joins or using an automatic index. The which = TRUE makes it return early with just the row numbers.

Here's the manual entry for the which argument inside data.table :

TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x. By default FALSE and the rows in x that match are returned.

Explanation:

Notice there is a specific relationship between .I and the i = .. argument in DT[i = .., j = .., by = ..] Namely, .I is a vector of row numbers of the subsetted table.

### Lets create some sample data set.seed(1) LL <- sample(LETTERS[1:5], 20, TRUE) DT <- data.table(X=LL)

look at the difference between subsetting the whole table, and subsetting just `.I`

DT[X == "B", .I] # [1] 1 2 3 4 5 6  DT[  , .I[X == "B"] ] # [1]  1  2  5 11 14 19

184

answered Oct 07 '22 21:10

Ricardo Saporta

Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT .I and .N do, not HOW to use them.

First let's check the documentation. I typed ?data.table and searched for .I. Here's what's there :

Advanced: When grouping, symbols .SD, .BY, .N, .I and .GRP may be used in the j expression, defined as follows.

.I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group its row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].

Emphasis added by me here. The original intention was for .I to be used while grouping. Note that there is in fact an example there in the documentation of HOW to use .I.

You aren't grouping.

That said, what you tried was reasonable. Over time these symbols have become to be used when not grouping as well. There might be a case that .I should return what you expected. I can see that using .I in j together with both i and by could be useful. Currently .I doesn't seem helpful when i is present, as you pointed out.

Using the which() function is good but might then circumvent optimization in i (which() needs a long logical input which has to be created and passed to it). Using the which=TRUE argument is good but then just returns the row numbers (you couldn't then do something with those row numbers in j by group).

Feature request #1494 filed to discuss changing .I to work the way you expected. The documentation does contain the words "its row location in x" which would imply what you expected since x is the whole data.table.

answered Oct 07 '22 21:10

Matt Dowle

Related questions
                            
                                How to solve the error " missing required header GL/gl.h" while installing the Package mvoutlier in R?
                            
                                Colour points in a plot differently depending on a vector of values
                            
                                remove the last element of a vector
                            
                                controlling the output with RApacheOutputErrors
                            
                                Multiple functions in one .Rd file
                            
                                How can I add freehand red circles to a ggplot2 graph?
                            
                                What is R's multidimensional equivalent of rbind and cbind?
                            
                                How to flatten a list to a list without coercion?
                            
                                Number formatting axis labels in ggplot2?
                            
                                Include levels of zero count in result of table()
                            
                                How to source() .R file saved using UTF-8 encoding?
                            
                                Conditionally Count in dplyr
                            
                                What exactly is a connection in R?
                            
                                Specifying row names when reading in a file
                            
                                How do you create a progress bar when using the "foreach()" function in R?
                            
                                How to calculate the 95% confidence interval for the slope in a linear regression model in R
                            
                                How do I suppress row names when using DT::renderDataTable in R shiny?
                            
                                R:how to get grep to return the match, rather than the whole string
                            
                                ggplot: remove lines at ribbon edges
                            
                                How to print R graphics to multiple pages of a PDF and multiple PDFs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using .I to return row numbers with data.table package

Tags:

r

data.table

user3351605

People also ask

2 Answers

Explanation:

look at the difference between subsetting the whole table, and subsetting just `.I`

Ricardo Saporta

Matt Dowle

Recent Activity

Donate For Us

Using .I to return row numbers with data.table package

Tags:

r

data.table

user3351605

People also ask

2 Answers

Explanation:

look at the difference between subsetting the whole table, and subsetting just .I

Ricardo Saporta

Matt Dowle

Related questions

Recent Activity

Donate For Us

look at the difference between subsetting the whole table, and subsetting just `.I`