I have a data frame with these values dummy vales and I want to do lm regression on them. One of the variables is a grouped continuous variable as shown below
df <- data.frame("y" = c(10, 11, 12, 13, 14),
"x" = as.factor(c("100-102", "103-105", "106-108", "109-111", "112-114")))
I want to regress y~x, One way is to replace the x factors with their mean numeric values. This is easily done using regular expression.
Another way is to create the additional rows and expand your dataset so it looks like this
data.frame("y" = c(10, 10, 10, 11, 11, 11......),
"x" = c(100, 101, 102, 103, 104, 105......))
Is there a function that will do this?
I'm thinking of first creating additional variables like x1, x2, x3 and then use reshape2 package to convert the x columns to rows.
You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.
Thus, to convert columns of an R data frame into rows we can use transpose function t. For example, if we have a data frame df with five columns and five rows then we can convert the columns of the df into rows by using as. data. frame(t(df)).
A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.”
gather( ) function: To reformat the data such that these common attributes are gathered together as a single variable, the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
A data.table
solution. This should be really fast on large data.frame
's as well.
require(data.table)
dt <- data.table(df, key="y")
dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))),by=y]
If you have more columns and you don't want each combinations while splitting by column x
, then this is the code to use:
require(data.table)
dt <- data.table(df)
# get all column names except "x"
key.cols <- setdiff(names(df), "x")
# set the data.table columns to key.cols
setkeyv(dt, key.cols)
dt.out <- dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))), by = key.cols]
This should give you what you expect.
require(stringr)
require(foreach)
foreach(i=1:nrow(df), .combine=rbind) %do% {
s <- as.numeric(str_extract_all(df$x[i], "[0-9]+")[[1]])
data.frame(y=rep(df$y[i], s[2]-s[1]+1), x=seq(s[1], s[2]))
}
If your data.frame
is really big you can go along with %dopar%
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With