Referring to data.table columns by names saved in variables

Q: What do you call columns in a dataset?

The term “field” is usually used interchangeably with “column,” but database purists prefer to use the word “field” to denote a particular value or single item of a column.

Q: How are variables usually shown on a data table?

Typically, the independent variable will be shown on the X axis and the dependent variable will be shown on the Y axis (just like you learned in math class!).

Q: What is used to identify a column in a Datatable?

By using the Column name or Column index we can identify a column in a data table.

Q: How do you name a data table in science?

Title the table; make sure the title relates to the data you will put in your table. The data table title is NOT a repeat of the research question; the title SHOULD be descriptive of the data contained in the table.

Tags:

r

data.table

data.table is a fantastic R package and I am using it in a library I am developing. So far all is going very well, except for one complication. It seems to be much more difficult (compared to the conventional data frames) to refer to data.table columns using names saved in variables (as for data frames would be, for example: colname="col"; df[df[,colname]<5,colname]=0).

Perhaps what complicates the things most is the apparent lack of consistency of syntax on this in data.table. In some cases, eval(colname) and get(colname), or even c(colname) seem to work. In others, DT[,colname, with=F] is the solution. Yet in others, such as, for example, the set() and subset() functions, I haven't found a solution at all. Finally, an extreme, albeit also quite common use case was discussed earlier (passing column names to data.table programmatically) and the proposed solutions, albeit apparently doing their job, did not seem particularly readable...

Perhaps I am complicating things too much? If anyone could jot down a quick cheatsheet for referring to data.table column names using variables for different common scenarios, I would be very grateful.

UPDATE:

Some specific examples that work provided I can hard code column names:

x.short = subset(x, abs(dist)<=100) set(x, which(x$val<10), "val", 0)

Now assume distcol="dist", valcol="val". What is the best way to do the above using distcol and valcol, but not dist and val?

384

asked May 17 '13 20:05

msp

1 Answers

If you are going to be doing complicated operations inside your j expressions, you should probably use eval and quote. One problem with that in current version of data.table is that the environment of eval is not always correctly processed - eval and quote in data.table (Note: There has been an update to that answer based on an update to the package.) - and the current fix for that is to add .SD to eval. As far as I can tell from a few tests that I've run this doesn't affect speed (the way e.g. having .SD[1] in j would).

Interestingly this issue only plagues the j and you'll be fine using eval normally in i (where .SD is not available anyway).

The other problem is assignment, and there you have to have strings. I know one way to extract the string name from a quoted expression - it's not pretty, but it works. Here's an example combining everything together:

x = data.table(dist = c(1:10), val = c(1:10)) distcol = quote(dist) valcol = quote(val)  x[eval(valcol) < 5,   capture.output(str(distcol, give.head = F)) := eval(distcol)*sum(eval(distcol, .SD))]

Note how I was ok not adding .SD in one eval(distcol), but won't be if I take it out of the other eval.

Another option is to use get:

diststr = "dist" valstr = "val"  x[get(valstr) < 5, c(diststr) := get(diststr)*sum(get(diststr))]

answered Sep 20 '22 11:09

eddi

Related questions
                            
                                Difference between rbind() and bind_rows() in R
                            
                                What's the use of which?
                            
                                How can I plot data with confidence intervals?
                            
                                What is the meaning of the dollar sign "$" in R function()?
                            
                                Get "embedded nul(s) found in input" when reading a csv using read.csv()
                            
                                ggplot2 - shade area between two vertical lines [duplicate]
                            
                                Avoid rbind()/cbind() conversion from numeric to factor
                            
                                How to fix 'tar: Failed to set default locale' error?
                            
                                Comparing two vectors in an if statement
                            
                                Use variable names in functions of dplyr
                            
                                Consolidate duplicate rows
                            
                                Generate a sequence of the last day of the month over two years
                            
                                Building R package and error "ld: cannot find -lgfortran"
                            
                                Convert date-time string to class Date
                            
                                Rename multiple dataframe columns, referenced by current names
                            
                                Convert radians to degree / degree to radians
                            
                                Add a row by reference at the end of a data.table object
                            
                                Using 3rd party header files with Rcpp
                            
                                Operator "[<-" in RStudio and R
                            
                                Dealing with missing values for correlations calculation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With