Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does plm not like my dplyr-created dataframe?

Tags:

r

dplyr

plm

If I perform simple and seemingly identical operations using, in one case, base R, and in the other case, dplyr, on two pdata.frames and then model them with lm(), I get the exact same results, as expected. If I then pass those datasets to plm(), the estimated model parameters (as well as the panel structure) differ between the datasets. Why would this be happening?

The toy example here illustrates my issue. Two panel dataframes, df_base and df_dplyr, are generated from a single source, df. When passed through lm(), both dataframes yield the same result. When passed through plm(), however, it appears that the panel structure becomes altered (see counts of n and T), resulting in differing estimation results.

Using R 4.2.3 with dplyr 1.1.1.

set.seed(1)

library(dplyr)
library(magrittr)
library(plm)

# Make toy dataframe
A = c(runif(100))
B = c(runif(100))
C = c(runif(100))
df <- data.frame(A,B,C)
df$id <- floor((as.numeric(rownames(df))-1)/10)
df$t <- ave(df$A, df$id, FUN = seq_along)

# Modify first copy of dataframe using base R
df_base <- pdata.frame(df, index = c('id','t')) 
df_base <- subset(df_base, (as.numeric(df_base$t)<8))

# Modify second copy of dataframe using dplyr
df_dplyr <- pdata.frame(df, index = c('id','t')) 
df_dplyr <- df_dplyr %>% 
  filter(as.numeric(t)<8) 

# Results are the same for lm()
print(summary(lm(A ~ B + C, data = df_base)))
print(summary(lm(A ~ B + C, data = df_dplyr)))

# Results differ for plm()
print(summary(plm(A ~ B + C,data = df_base, method = "within")))
print(summary(plm(A ~ B + C,data = df_dplyr, method = "within")))
like image 394
snowpeak Avatar asked Sep 14 '25 09:09

snowpeak


1 Answers

dplyr is not "pdata.frame-friendly". A pdata.frame has an index attribute to enable panel operations and when subsetting rows, the index needs to be adjusted as well - this is what dpylr does not do.

You can see that by:

nrow(df_dplyr) # 70
nrow(index(df_dplyr)) # 100

nrow(df_base) # 70
nrow(index(df_base)) # 70

Now, to fix the scrambled data, just do:

df_dplyr_fixed <- pdata.frame(as.data.frame(df_dplyr), c("id", "t"))
print(summary(plm(A ~ B + C,data = df_dplyr_fixed)))
like image 195
Helix123 Avatar answered Sep 16 '25 23:09

Helix123