Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to interpret R linear regression when there are multiple factor levels as the baseline? [closed]

My data has 3 independent variables, all of which are categorical:

condition: cond1, cond2, cond3

population: A,B,C

task: 1,2,3,4,5

The dependent variable is the task completion time. I run lm(time~condition+user+task,data) in R and get the following results:

enter image description here

What confuses me is that cond1, groupA, and task1 are left out from the results. From the thread linear regression "NA" estimate just for last coefficient, I understand that one factor level is chosen as the "baseline" and shown in the (Intercept) row.

But what if there are multiple factor levels used as the baseline, as in the above case?

  • Does the (Intercept) row now indicates cond1+groupA+task1?
  • What if I want to know the coefficient and significance for cond1, groupA, and task1 individually?
  • For example, groupB has an estimated coefficient +9.3349, compared to groupA? Or compared to cond1+groupA+task1?
like image 513
Ida Avatar asked Feb 10 '14 12:02

Ida


People also ask

What is a good R value for multiple linear regression?

Generally R2 is the measure presented. A good R depends on many factors. If I am running standards on a GC-MS I should expect an R2 of almost 1.0. A value of 0.8 will probably result in unpublishable research.

How do you determine if a variable is statistically significant in R?

The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. A p-value less than 0.05 (typically ≤ 0.05) is statistically significant.


2 Answers

One person of your population must have one value for each variable 'condition', 'population' and 'task', so the baseline individual must have a value for each of this variables; in this case, cond1, A and t1. All of the results are based over the ideal (mean) individual with these independent variables, so the intercept do give the mean value of time for cond1, groupA and task1.

The significance or coefficient for cond1, groupA or task1 makes no sense, as significance means significant different mean value between one group and the reference group. You can not compare the reference group against itself.

As your model has no interactions, the coefficient for groupB means that the mean time for somebody in population B will be 9.33(seconds?) higher than the time for somebody in population A, regardless of the condition and task they are performing, and as the p-value is very small, you can stand that the mean time is in fact different between people in population B and people in the reference population (A). If you added an interaction term to the model, these terms (for example usergroupB:taskt4) would indicate the extra value added (or substracted) to the mean time if an individual has both conditions (in this example, if an individual is from population B and has performed task 4). These effects would be added to the marginal ones (usergroupB and taskt4).

Hope I helped.

like image 121
Rufo Avatar answered Nov 14 '22 21:11

Rufo


Does the (Intercept) row now indicates cond1+groupA+task1?

Yes.

What if I want to know the coefficient and significance for cond1, groupA, and task1 individually?

Think about what significance means. You need to formulate a hypothesis. In your example everything is compared to the intercept and your question doesn't really make sense. However, you can always conduct pairwise comparisons between all possible effect combinations (see package multcomp).

For example, groupB has an estimated coefficient +9.3349, compared to groupA? Or compared to cond1+groupA+task1?

It's the difference between cond1/task1/groupA and cond1/task1/groupB. (As @Rufo correctly points out, it is of course an overall effect and actually the difference between groupB and groupA provided the other effects are equal.)

like image 45
Roland Avatar answered Nov 14 '22 23:11

Roland