Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a grouped continous variable into rows in R

I have a data frame with these values dummy vales and I want to do lm regression on them. One of the variables is a grouped continuous variable as shown below

df <- data.frame("y" = c(10, 11, 12, 13, 14),
                 "x" = as.factor(c("100-102", "103-105", "106-108", "109-111", "112-114")))

I want to regress y~x, One way is to replace the x factors with their mean numeric values. This is easily done using regular expression.

Another way is to create the additional rows and expand your dataset so it looks like this

data.frame("y" = c(10, 10, 10, 11, 11, 11......),
           "x" = c(100, 101, 102, 103, 104, 105......))

Is there a function that will do this?

I'm thinking of first creating additional variables like x1, x2, x3 and then use reshape2 package to convert the x columns to rows.

like image 834
MySchizoBuddy Avatar asked Feb 09 '13 22:02

MySchizoBuddy


People also ask

How do you split a continuous variable into a group in R?

You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.

How do I convert columns to rows in R?

Thus, to convert columns of an R data frame into rows we can use transpose function t. For example, if we have a data frame df with five columns and five rows then we can convert the columns of the df into rows by using as. data. frame(t(df)).

How do you split a continuous variable?

A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.”

What function do we use to take multiple rows of data and condense them by adding more columns?

gather( ) function: To reformat the data such that these common attributes are gathered together as a single variable, the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.


2 Answers

A data.table solution. This should be really fast on large data.frame's as well.

require(data.table)
dt <- data.table(df, key="y")
dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))),by=y]

If you have more columns and you don't want each combinations while splitting by column x, then this is the code to use:

require(data.table)
dt <- data.table(df)
# get all column names except "x"
key.cols <- setdiff(names(df), "x") 
# set the data.table columns to key.cols
setkeyv(dt, key.cols)
dt.out <- dt[, list(x=seq(sub("-.*$", "", x), sub(".*-", "", x))), by = key.cols]

This should give you what you expect.

like image 129
Arun Avatar answered Sep 30 '22 10:09

Arun


require(stringr)
require(foreach)

foreach(i=1:nrow(df), .combine=rbind) %do% {
  s <- as.numeric(str_extract_all(df$x[i], "[0-9]+")[[1]])
  data.frame(y=rep(df$y[i], s[2]-s[1]+1), x=seq(s[1], s[2]))  
}

If your data.frame is really big you can go along with %dopar%.

like image 21
redmode Avatar answered Sep 30 '22 09:09

redmode