Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear model and dplyr - a better solution?

Tags:

r

dplyr

I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.

First I have data that looks like this:

Var1    Var2    Var3    Time           Temp
a       w       j       9/9/2014       20
a       w       j       9/9/2014       15
a       w       k       9/20/2014       10
a       w       j       9/10/2014       0
b       x       L       9/12/2014       30
b       x       L       9/12/2014       10
b       y       k       9/13/2014       20
b       y       k       9/13/2014       15
c       z       j       9/14/2014       20
c       z       j       9/14/2014       10
c       z       k       9/14/2014       11
c       w       l       9/10/2014       45
a       d       j       9/22/2014       20
a       d       k       9/15/2014       4
a       d       l       9/15/2014       23
a       d       k       9/15/2014       11

And I want it in the form of this (values for Slope and Pearson simulated for illustration):

V1  V2  V3  Slope   Pearson
a   w   j   -3      -0.9
a   w   k   2       0
a   d   j   1.5     0.6
a   d   k   0       0.5
a   d   l   -0.5    -0.6
b   x   L   12      0.7
b   y   k   4       0.6
c   z   j   -1      -0.5
c   z   k   -3      -0.4
c   w   l   -10     -0.9

The slope being a linear-least-squares slope. In theory, the script would look like so:

library(dplyr)

data <- read.table("clipboard",sep="\t",quote="",header=T)

newdata = summarise(group_by(data
                              ,Var1
                              ,Var2
                              ,Var3                            
                              )
                     ,Slope = lm(Temp ~ Time)$coeff[2]                 
                     ,Pearson = cor(Time, Temp, method="pearson")
                     )

But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

like image 368
AI52487963 Avatar asked Nov 05 '14 19:11

AI52487963


People also ask

What are some advantages of using linear models?

The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).

Why linear regression is best?

Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear-regression models are well understood and can be trained very quickly.

Are linear models still useful at all?

In morden machine learning, say, text classification, linear model is still very important, although there are other fancier models. This is because linear model is very "stable", it will have less like to over fit the data.

Why can't I run a linear regression model with 3 variables?

If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense The third issue here is that you will need to use do in order to run your linear model

How does a linear model work?

The linear model generally works around two parameters: one is slope which is often known as the rate of change and the other one is intercept which is basically an initial value. These models are very common in use when we are dealing with numeric data. Outcomes of these models can easily break down to reach over final results.

How to choose the ‘best’ linear regression model?

This article will discuss the following metrics for choosing the ‘best’ linear regression model: R-Squared (R²), Mean Absolute Error (MAE), Mean Squared Error (MSE), Root-Mean Square Error (RMSE), Akaike Information Criterion (AIC), and corrected variants of these that account for bias. A knowledge of linear regression will be assumed.

What is linear model in R?

Linear Model in R Introduction to Linear Model in R A statistical or mathematical model that is used to formulate a relationship between a dependent variable and single or multiple independent variables called as, linear model in R.


Video Answer


1 Answers

You have several issues here.

  1. If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
  2. Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
  3. The third issue here is that you will need to use do in order to run your linear model

Here's an illustration for grouping only on V1

data %>%
  group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
  do(mod = lm(Temp ~ Time, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2]) %>%
  select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#   
#   Var1     Slope
# 1    a  12.66667
# 2    b  -2.50000
# 3    c -31.33333 

If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)

data %>%
  mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
         test2 = sample(1:3, n(), replace = TRUE)) %>%
  group_by(Var1) %>%
  do(mod = lm(Temp ~ Time, data = .),
     mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
  mutate(Slope = summary(mod)$coeff[2],
         Pearson = mod2[1]) %>%
  select(-mod, -mod2)


# Source: local data frame [3 x 3]
# Groups: <by row>
#   
#   Var1     Slope     Pearson
# 1    a  12.66667  0.25264558
# 2    b  -2.50000 -0.09090909
# 3    c -31.33333  0.30151134

Bonus solution: you can do this quite efficiently/easily with data.table package too

library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
#    Var1     Slope
# 1:    a  12.66667
# 2:    b  -2.50000
# 3:    c -31.33333

Or if we want to create some dummy variables too

library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), 
                   test2 = sample(1:3, .N, replace = TRUE))][, 
                   list(Slope = summary(lm(Temp ~ Time))$coeff[2],
                        Pearson = cor(test1, test2, method = "pearson")), Var1]
#    Var1     Slope     Pearson
# 1:    a  12.66667 -0.02159168
# 2:    b  -2.50000 -0.81649658
# 3:    c -31.33333 -1.00000000
like image 107
David Arenburg Avatar answered Sep 25 '22 03:09

David Arenburg