I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question. First I have data that looks like this: <pre class="prettyprint"><code>Var1 Var2 Var3 Time Temp a w j 9/9/2014 20 a w j 9/9/2014 15 a w k 9/20/2014 10 a w j 9/10/2014 0 b x L 9/12/2014 30 b x L 9/12/2014 10 b y k 9/13/2014 20 b y k 9/13/2014 15 c z j 9/14/2014 20 c z j 9/14/2014 10 c z k 9/14/2014 11 c w l 9/10/2014 45 a d j 9/22/2014 20 a d k 9/15/2014 4 a d l 9/15/2014 23 a d k 9/15/2014 11 </code></pre> And I want it in the form of this (values for Slope and Pearson simulated for illustration): <pre class="prettyprint"><code>V1 V2 V3 Slope Pearson a w j -3 -0.9 a w k 2 0 a d j 1.5 0.6 a d k 0 0.5 a d l -0.5 -0.6 b x L 12 0.7 b y k 4 0.6 c z j -1 -0.5 c z k -3 -0.4 c w l -10 -0.9 </code></pre> The slope being a linear-least-squares slope. In theory, the script would look like so: <pre class="prettyprint"><code>library(dplyr) data <- read.table("clipboard",sep="\t",quote="",header=T) newdata = summarise(group_by(data ,Var1 ,Var2 ,Var3 ) ,Slope = lm(Temp ~ Time)$coeff[2] ,Pearson = cor(Time, Temp, method="pearson") ) </code></pre> But R throws an error like it can't find Time or Temp. It can run <code>lm(data$Temp ~ data$Time)$coeff[2]</code>, but returns the slope for the entire data set and not the subsetted form that I'm looking for. <code>cor()</code> seems to run just fine in the <code>group_by</code> section, so is there a specific syntax I need to pass to <code>lm()</code> to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

You have several issues here. <ol> <li>If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model</li> <li>Pearson requires two numeric values, while <code>Time</code> is a factor which converting to numeric won't make much sense</li> <li>The third issue here is that you will need to use <code>do</code> in order to run your linear model</li> </ol> Here's an illustration for grouping only on <code>V1</code> <pre class="prettyprint"><code>data %>% group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it do(mod = lm(Temp ~ Time, data = .)) %>% mutate(Slope = summary(mod)$coeff[2]) %>% select(-mod) # Source: local data frame [3 x 2] # Groups: <by row> # # Var1 Slope # 1 a 12.66667 # 2 b -2.50000 # 3 c -31.33333 </code></pre> <hr> If you do have two numeric variables, you can use <code>do</code> in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration) <pre class="prettyprint"><code>data %>% mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables test2 = sample(1:3, n(), replace = TRUE)) %>% group_by(Var1) %>% do(mod = lm(Temp ~ Time, data = .), mod2 = cor(.$test1, .$test2, method = "pearson")) %>% mutate(Slope = summary(mod)$coeff[2], Pearson = mod2[1]) %>% select(-mod, -mod2) # Source: local data frame [3 x 3] # Groups: <by row> # # Var1 Slope Pearson # 1 a 12.66667 0.25264558 # 2 b -2.50000 -0.09090909 # 3 c -31.33333 0.30151134 </code></pre> <hr> Bonus solution: you can do this quite efficiently/easily with <code>data.table</code> package too <pre class="prettyprint"><code>library(data.table) setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1] # Var1 Slope # 1: a 12.66667 # 2: b -2.50000 # 3: c -31.33333 </code></pre> Or if we want to create some dummy variables too <pre class="prettyprint"><code>library(data.table) setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), test2 = sample(1:3, .N, replace = TRUE))][, list(Slope = summary(lm(Temp ~ Time))$coeff[2], Pearson = cor(test1, test2, method = "pearson")), Var1] # Var1 Slope Pearson # 1: a 12.66667 -0.02159168 # 2: b -2.50000 -0.81649658 # 3: c -31.33333 -1.00000000 </code></pre>

Linear model and dplyr - a better solution?

Tags:

r

dplyr

I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.

First I have data that looks like this:

Var1    Var2    Var3    Time           Temp
a       w       j       9/9/2014       20
a       w       j       9/9/2014       15
a       w       k       9/20/2014       10
a       w       j       9/10/2014       0
b       x       L       9/12/2014       30
b       x       L       9/12/2014       10
b       y       k       9/13/2014       20
b       y       k       9/13/2014       15
c       z       j       9/14/2014       20
c       z       j       9/14/2014       10
c       z       k       9/14/2014       11
c       w       l       9/10/2014       45
a       d       j       9/22/2014       20
a       d       k       9/15/2014       4
a       d       l       9/15/2014       23
a       d       k       9/15/2014       11

And I want it in the form of this (values for Slope and Pearson simulated for illustration):

V1  V2  V3  Slope   Pearson
a   w   j   -3      -0.9
a   w   k   2       0
a   d   j   1.5     0.6
a   d   k   0       0.5
a   d   l   -0.5    -0.6
b   x   L   12      0.7
b   y   k   4       0.6
c   z   j   -1      -0.5
c   z   k   -3      -0.4
c   w   l   -10     -0.9

The slope being a linear-least-squares slope. In theory, the script would look like so:

library(dplyr)

data <- read.table("clipboard",sep="\t",quote="",header=T)

newdata = summarise(group_by(data
                              ,Var1
                              ,Var2
                              ,Var3                            
                              )
                     ,Slope = lm(Temp ~ Time)$coeff[2]                 
                     ,Pearson = cor(Time, Temp, method="pearson")
                     )

But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

368

asked Nov 05 '14 19:11

AI52487963

Video Answer

1 Answers

You have several issues here.

If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
The third issue here is that you will need to use do in order to run your linear model

Here's an illustration for grouping only on V1

data %>%
  group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
  do(mod = lm(Temp ~ Time, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2]) %>%
  select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#   
#   Var1     Slope
# 1    a  12.66667
# 2    b  -2.50000
# 3    c -31.33333

If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)

data %>%
  mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
         test2 = sample(1:3, n(), replace = TRUE)) %>%
  group_by(Var1) %>%
  do(mod = lm(Temp ~ Time, data = .),
     mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
  mutate(Slope = summary(mod)$coeff[2],
         Pearson = mod2[1]) %>%
  select(-mod, -mod2)


# Source: local data frame [3 x 3]
# Groups: <by row>
#   
#   Var1     Slope     Pearson
# 1    a  12.66667  0.25264558
# 2    b  -2.50000 -0.09090909
# 3    c -31.33333  0.30151134

Bonus solution: you can do this quite efficiently/easily with data.table package too

library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
#    Var1     Slope
# 1:    a  12.66667
# 2:    b  -2.50000
# 3:    c -31.33333

Or if we want to create some dummy variables too

library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), 
                   test2 = sample(1:3, .N, replace = TRUE))][, 
                   list(Slope = summary(lm(Temp ~ Time))$coeff[2],
                        Pearson = cor(test1, test2, method = "pearson")), Var1]
#    Var1     Slope     Pearson
# 1:    a  12.66667 -0.02159168
# 2:    b  -2.50000 -0.81649658
# 3:    c -31.33333 -1.00000000

107

answered Sep 25 '22 03:09

David Arenburg

Related questions
                            
                                subsetting data.frame without column names
                            
                                Using strsplit() in R, ignoring anything in parentheses
                            
                                Why is Date is being returned as type 'double'?
                            
                                Incorporating interactive shiny apps into Rmarkdown document for blogdown Hugo blog
                            
                                no visible global function definition for ':='
                            
                                R - how do I declare a vector of Date?
                            
                                What 1-2 letter object names conflict with existing R objects?
                            
                                Sequence length encoding using R
                            
                                debugging a function in R that was not exported by a package
                            
                                Order Stacked Bar Graph in ggplot [duplicate]
                            
                                Modifying the shape for a subset of points with ggplot2
                            
                                Predicted values for logistic regression from glm and stat_smooth in ggplot2 are different
                            
                                handling special characters e.g. accents in R
                            
                                R: unexpected results from p.adjust (FDR)
                            
                                tryCatch does not catch an error if called though RScript
                            
                                Why does `a ^ b` return a numeric when both `a` and `b` are integers?
                            
                                R error which says "Models were not all fitted to the same size of dataset"
                            
                                Rscript could not find function
                            
                                Cross validation for glm() models
                            
                                How to remove coordinate in pie-chart generated by ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With