Accessing a data.table via column numbers and grep

Question

If I use this simple data.table (one column)

mydata <- data.table(A=c("ID123", "ID22", "AAA", NA))

I can find the position of the rows starting by "ID"

grep("^ID", mydata[,A])

How can I get the same result using numbers instead? (saying I want the first column).

I've tried

grep("^ID", mydata[,1, with=F])

but it doesn't work.

And more important, I would like to do it in the data.table way, introducing the command inside the brackets.

mydata[,grep("^ID",.SD), .SDcols=1]

But this doesn't work.

I've found this way, but it's too convoluted

mydata[,lapply(.SD, grep,pattern="ID"), .SDcols=1]

What's the proper way to do it?

A little bit more complex:
What if I want to count simultaneously how many rows are not NA and start by "ID"?

Something like

any(!(grepl("^ID", mydata[,A] ) | is.na(mydata[,A])))

but more compact and inside the brackets.

I don't like the fact that grep considers the NA as a not matching instead of outputing an NA too.

Matt Dowle · Accepted Answer

Don't forget that data.table is a list, too. So if you really and just want an entire column as a vector then it is encouraged just to use base R methods on it: [[ and $.

mydata <- data.table(A=c("ID123", "ID22", "AAA"))
mydata
#       A
#1: ID123
#2:  ID22
#3:   AAA
grep("^ID", mydata[[1]])   # using a column number
#[1] 1 2
grep("^ID", mydata$A)
#[1] 1 2

If you need this in a loop then [[ and $ are faster as they avoid the overhead of argument checking inside DT[...]. If it's just one call then that overhead is negligible.

grep("^ID", mydata[,1, with=F]) "doesn't work" (please include the error message that you saw instead of "does't work"!) because grep wants a vector but DT[] always returns a data.table, even if 1-column, for important type consistency e.g. when chaining. mydata[[1]] directly is cleaner, but another way just to illustrate is grep("^ID", mydata[,1,with=F][[1]]).

As Frank said in comments, using column numbers is highly discouraged because of the potential for bugs as your data changes over the months and years into the future as the documentation explains. Use column names instead, within DT[...].

But if you really must, and sometimes it's valid, then how about :

..theCol = DT[[theNumber]]
DT[ grep(,..theCol) & ..theCol | ..theCol etc , ... ]

The .. prefix in your variable name kind of means "one up" like a directory path. But any variable name that for sure isn't a column name would do. This way you can use it several times inside DT[...] without having to repeat both the table name DT and the column number just to access the column by number several times. (We try to avoid symbol name repetition as much as possible to reduce the potential for bugs due to typos.)

IRTFM · Answer

One data.table way of indexing a column by number would be to convert to a column name , convert to an R symbol, and evaluate:

mydata[ , eval( as.symbol( names(mydata)[1] ) )]
[1] "ID123" "ID22"  "AAA" 

> grep("^ID", mydata[,eval(as.symbol(names(mydata)[1]))])
[1] 1 2

But this is not really an approved path to success because of the DT FAQ #1 as well as the fact that row numbers are not considered as valid targets. The philosophy (as I understand it) is that row numbers are accidental and you should be storing your records with unique identifiers.

Accessing a data.table via column numbers and grep

Tags:

r

data.table

skan

2 Answers

Matt Dowle

IRTFM

Recent Activity

Donate For Us

Accessing a data.table via column numbers and grep

Tags:

r

data.table

skan

2 Answers

Matt Dowle

IRTFM

Related questions

Recent Activity

Donate For Us