Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to use aggregate to run linear model on subset of values in a column

I have roughly 7,500 subsidence values. Each subsidence value (V9) corresponds to a station (V2) and a year (V3). I want a line of best fit (V9~V3) for each station.

I created a function using lm that works when I manually subset the data. However, when I try to use aggregate to run a linear model on each station, I get the same value for every station.

Here's an example of what some of the data looks like:

V2        V3    V9
CRMS0002  2008  -28.4990000
CRMS0002  2009  -28.8080808
CRMS0002  2012  -31.9871795
CRMS0006  2008  -56.8998413
CRMS0006  2013   40.8611111
CRMS0006  2015   32.8555555
CRMS0033  2007  -16.8044444

This is the code:

sub_rate = function(x) {lm(CRMSsub$V9~CRMSsub$V3)}
agg <- aggregate(CRMSsub$V9, by = list(CRMSsub$V2), FUN = sub_rate)

I also tried:

agg <- lapply(split(CRMSsub, CRMSsub$V3), FUN = sub_rate)

The aggregate by part of the first and second code works. So I get 354 elements that are organized by station, but the linear model results which give me intercept and slope are the same for every station, which means it's not performing my function by station. Here's an example of the result:

Group.1         x
CRMS0002        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)
CRMS0006        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)
CRMS0033        c(`(Intercept)` = -2333.06378840009, `CRMSsub$V3` = 1.1541441797906)
like image 855
Catherine Fitzpatrick Avatar asked Sep 10 '19 14:09

Catherine Fitzpatrick


2 Answers

Try not to overcomplicate it. Just split on the variable you want to regress, and simply apply lm on the right columns

lapply(split(CRMSsub, CRMSsub$V2), function(i)lm(V9 ~ V3, i))

Which gives,

$CRMS0002

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
  1809.7832      -0.9153  


$CRMS0006

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
  -28396.65        14.12  


$CRMS0033

Call:
lm(formula = V9 ~ V3, data = i)

Coefficients:
(Intercept)           V3  
      -16.8           NA  
like image 81
Sotos Avatar answered Oct 19 '22 07:10

Sotos


The problem with your approach is that you specified your dataset in the call of sub_rate. You also need to specify the dataset as x in lapply(). For instance you can do:

library(dplyr)

sub_rate <- function(x){lm(x$V9~x$V3)}

lapply(CRMSsub %>% split(.$V2),sub_rate)
like image 34
alex_555 Avatar answered Oct 19 '22 09:10

alex_555