I often need to select a set of variables from a data.frame in R. My research is in the social and behavioural sciences, and it is quite common to have a data.frame with several hundreds of variables (e.g., there'll be item level information for a range of survey questions, demographic items, performance measures, etc., etc.).
As part of analyses, I'll often want to select a subset of variables. For example, I might want to get:
Now, I know that there are many ways to write the code to select a subset of variables. Quick-r has a nice overview of common ways of extracting variable subsets from a data.frame.
e.g.,
myvars <- c("v1", "v2", "v3") newdata <- mydata[myvars]
However, I'm interested in the efficiency of this process, particularly where you might need to extract 20 or so variables from a data.frame. The naming convention of variables is often not intuitive, especially where you've inherited a dataset from someone else, so you might be left wondering, was the variable Gender
, gender
, sex
, GENDER
, gender1
, etc. Multiply this by 20 variables that need to be extracted, and the task of memorising variable names becomes more complicated than it needs to be.
To make the following discussion concrete, I'll use the bfi
data.frame in the psych
package.
library(psych) data(bfi) df <- bfi head(df, 1) A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 O5 gender education age 61617 3 1 NA 16
A1, A2, A3, A5, C2, C3, C5, E2, E3, gender, education, age
?I currently have a range of strategies that I use. Of course sometimes I can exploit things like the numeric position of the variables or the naming convention and use either grep
to select or paste
to construct. But sometimes I need a more general solution. I've used the following over the while:
In the early days, I used to call names(df)
, copy the quoted variable names and then edit until I have what I want.
Sometimes I'll have a separate data.frame that stores each variable as a row, and has columns for variable names, variable labels, and it has a column which indicates whether the variable should be retained for a particular analysis. I can then filter on that include
variable and extract a vector of variable names. I find this particularly useful when I'm developing a psychological test and for various iterations I want to include or exclude certain items.
As Hadley Wickham once pointed out to me dput
is a good option; e.g., dput(names(df))
is better than names(df)
in that it outputs a list that is already in the c("var1", "var2", ...)
format:
dput(names(df)) c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", "E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1", "O2", "O3", "O4", "O5", "gender", "education", "age")
This can then be copied into the script and edited.
I guess dput
is a pretty good variable selection strategy. The efficiency of the process largely depends on how proficient you are in copying the text into your script and then editing the list of names down to those desired.
However, I still remember the efficiency of GUI based systems of variable selection. For example, in SPSS when you interact with a dialogue box you can point and click with the mouse the variables you want from the dataset. You can shift-click to select a range of variables, you can hold shift and press the down key to select one or more variables, and so on. And then you can press Paste
and the command with extracted variable names is pasted into your script editor.
guiselect(df)
opens a gui window for variable selection), and returns a vector of variable names selected c("var1", "var2", ...)
?dput
the best general option for selecting a set of variable names in R? Or is there a better way?Update (April 2017): I have posted my own understanding of a good strategy below.
You can use ls() to list all variables that are created in the environment. Use ls() to display all variables.
To find the most frequent factor value in an R data frame column, we can use names function with which. max function after creating the table for the particular column. This might be required while doing factorial analysis and we want to know which factor occurs the most.
The select() function is used to pick specific variables or features of a DataFrame or tibble. It selects columns based on provided conditions like contains, matches, starts with, ends with, and so on.
I'm personally a fan of the myvars <- c(...)
and then using mydf[,myvars]
from there on in.
However this still requires you to enter the initial variable names (even though just once), and as far as I read your question, it is this initial 'picking variable names' that is what you're asking about.
Re a simple no-frills GUI device -- I've recently been introduced to the menu
function, which is exactly a simple no-frills GUI device for selecting one object out of a list of choices. Try menu(names(df),graphics=TRUE)
to see what I mean (returns the column number). It even gives a nice text interface if for some reason your system can't do the graphics (try with graphics=FALSE
to see what I mean).
However this is of limited use to you, as you can only select one column name. To select multiple, you can use select.list
(mentioned in ?menu
as the alternative to make multiple selections):
# example with iris data (I don't have 'psych' package): vars <- select.list(names(iris),multiple=TRUE, title='select your variable names', graphics=TRUE)
This also takes a graphics=TRUE
option (single click on all the items you want to select). It returns the names of the variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With