I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.
First I have data that looks like this:
Var1 Var2 Var3 Time Temp
a w j 9/9/2014 20
a w j 9/9/2014 15
a w k 9/20/2014 10
a w j 9/10/2014 0
b x L 9/12/2014 30
b x L 9/12/2014 10
b y k 9/13/2014 20
b y k 9/13/2014 15
c z j 9/14/2014 20
c z j 9/14/2014 10
c z k 9/14/2014 11
c w l 9/10/2014 45
a d j 9/22/2014 20
a d k 9/15/2014 4
a d l 9/15/2014 23
a d k 9/15/2014 11
And I want it in the form of this (values for Slope and Pearson simulated for illustration):
V1 V2 V3 Slope Pearson
a w j -3 -0.9
a w k 2 0
a d j 1.5 0.6
a d k 0 0.5
a d l -0.5 -0.6
b x L 12 0.7
b y k 4 0.6
c z j -1 -0.5
c z k -3 -0.4
c w l -10 -0.9
The slope being a linear-least-squares slope. In theory, the script would look like so:
library(dplyr)
data <- read.table("clipboard",sep="\t",quote="",header=T)
newdata = summarise(group_by(data
,Var1
,Var2
,Var3
)
,Slope = lm(Temp ~ Time)$coeff[2]
,Pearson = cor(Time, Temp, method="pearson")
)
But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2]
, but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor()
seems to run just fine in the group_by
section, so is there a specific syntax I need to pass to lm()
to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?
The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).
Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear-regression models are well understood and can be trained very quickly.
In morden machine learning, say, text classification, linear model is still very important, although there are other fancier models. This is because linear model is very "stable", it will have less like to over fit the data.
If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense The third issue here is that you will need to use do in order to run your linear model
The linear model generally works around two parameters: one is slope which is often known as the rate of change and the other one is intercept which is basically an initial value. These models are very common in use when we are dealing with numeric data. Outcomes of these models can easily break down to reach over final results.
This article will discuss the following metrics for choosing the ‘best’ linear regression model: R-Squared (R²), Mean Absolute Error (MAE), Mean Squared Error (MSE), Root-Mean Square Error (RMSE), Akaike Information Criterion (AIC), and corrected variants of these that account for bias. A knowledge of linear regression will be assumed.
Linear Model in R Introduction to Linear Model in R A statistical or mathematical model that is used to formulate a relationship between a dependent variable and single or multiple independent variables called as, linear model in R.
You have several issues here.
Time
is a factor which converting to numeric won't make much sensedo
in order to run your linear modelHere's an illustration for grouping only on V1
data %>%
group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
do(mod = lm(Temp ~ Time, data = .)) %>%
mutate(Slope = summary(mod)$coeff[2]) %>%
select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Var1 Slope
# 1 a 12.66667
# 2 b -2.50000
# 3 c -31.33333
If you do have two numeric variables, you can use do
in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)
data %>%
mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
test2 = sample(1:3, n(), replace = TRUE)) %>%
group_by(Var1) %>%
do(mod = lm(Temp ~ Time, data = .),
mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
mutate(Slope = summary(mod)$coeff[2],
Pearson = mod2[1]) %>%
select(-mod, -mod2)
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# Var1 Slope Pearson
# 1 a 12.66667 0.25264558
# 2 b -2.50000 -0.09090909
# 3 c -31.33333 0.30151134
Bonus solution: you can do this quite efficiently/easily with data.table
package too
library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
# Var1 Slope
# 1: a 12.66667
# 2: b -2.50000
# 3: c -31.33333
Or if we want to create some dummy variables too
library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE),
test2 = sample(1:3, .N, replace = TRUE))][,
list(Slope = summary(lm(Temp ~ Time))$coeff[2],
Pearson = cor(test1, test2, method = "pearson")), Var1]
# Var1 Slope Pearson
# 1: a 12.66667 -0.02159168
# 2: b -2.50000 -0.81649658
# 3: c -31.33333 -1.00000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With