R sample from unbalanced panel data

Question

I am working with unbalanced panel data from which I would like to draw a random sample that is unbiased by the differing number of observations per unit. For example, in the code below, IBM is two times more likely to be selected than GOOG and five times more likely to be selected than MSFT. Is there any way to sample this data as if each company/year has an equal probability of being selected? Possibly by using the sampling package?

df <- data.frame(COMPANY=c(rep('IBM',50),rep('GOOG',25),rep('MSFT',10)), YEAR=c(1961:2010,1988:2012,1996:2005), PROFIT=rnorm(85))
df

df[sample(nrow(df), 20, replace=FALSE), ]

Julius Vainora · Accepted Answer

Here is what you could do:

probs <- 1 / table(df$COMPANY)[df$COMPANY]
df[sample(nrow(df), 20, replace = FALSE, prob = probs), ]

Let us test it:

table(df[sample(nrow(df), 1e6, replace = TRUE, prob = probs), "COMPANY"])
#   GOOG    IBM   MSFT 
# 333499 333080 333421

Instead of having probabilities for every row equal to 1/(50+25+10) we normalised them so that every company would have equal probability to be chosen:

tapply(probs, df$COMPANY, sum)
# GOOG  IBM MSFT 
#   1    1    1

(probs sums to 3 instead of 1, but sample takes care of that). To make the math clearer let us take a simple example (which again does not sum to 1, but that is not a problem):

vec <- c(1, 1, 2)
as.vector(1 / table(vec)[vec])
# [1] 0.5 0.5 1.0

overeducatedpoverty · Answer

i'm just a new R user, but here is my solution:

load example data (based on the PSID). data are unbalanced panel data: 98 individual observations, 15 groups, between 1977 and 1983 with gender identification (not used)

df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L,5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L,13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L,15L, 15L), year = c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1979L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L,1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L,1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L,1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L), gender = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("id", "year","gender"), row.names = c(NA, 98L), class = "data.frame")

create data frame with 1 observation per group id (in this example, there are 15 distinct groups)

sample <- select(df, id) %>% group_by(id) %>% sample_n(1)

create sample of 5 random observations out of 15

sample <- ungroup(sample) %>% sample_n(5) %>% mutate(id=row_number())

merge m:1 old data frame with sample data frame

df_new <- merge(x = df, y = sample, by = "id", all.y = TRUE)

R sample from unbalanced panel data

Tags:

random

r

panel-data

user1491868

2 Answers

Julius Vainora

overeducatedpoverty

Recent Activity

Donate For Us

R sample from unbalanced panel data

Tags:

random

r

panel-data

user1491868

2 Answers

Julius Vainora

overeducatedpoverty

Related questions

Recent Activity

Donate For Us