Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample to have an equal number of each sex within groups in R

Tags:

r

statistics

First things, first. Here are my data:

lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)

data <- data.frame(lat, long, sex, score)

The data should look like this:

   lat long sex score
1   12  -14   M     2
2   12  -14   M     6
3   58  139   M     3
4   58  139   M     6
5   58  139   F     5
6   58  139   M     4
7   58  139   M     3
8   45  -68   F     2
9   45  -68   M     3
10  45  -68   M     9
11  45  -68   M     9
12  45  -68   F     8
13  45    1   M     6
14  64    1   F     5
15  64    1   M     6
16  64    1   F     7
17  64    1   F     5
18  64    1   F     7
19  64    1   F     5
20  64    1   M     1

I am at my wits end trying to figure this one out. The variables are latitude, longitude, sex and score. I would like to have an equal number of males and females within each location (i.e. with the same longitude and latitude). For instance, the second location (rows 3 to 7) has only one female. This female should be retained and one male from the remaining individuals should also be retained (by random sampling, perhaps). Some locations have only information about one sex, e.g. the first location (rows 1 and 2) has only data on males. The rows from this location should be dropped (since there are no females). All going according to plan the final dataset should look something like this:

   lat2 long2 sex2 score2
1    58   139    F      5
2    58   139    M      4
3    45   -68    F      2
4    45   -68    M      3
5    45   -68    M      9
6    45   -68    F      8
7    64     1    M      6
8    64     1    F      5
9    64     1    F      7
10   64     1    M      1

Any help would be appreciated.

like image 695
Ciaran Avatar asked Jan 20 '26 11:01

Ciaran


2 Answers

Here's a solution with lapply:

data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
        # 'split' splits the sequence of row numbers (indices) along the unique
        # combinations of 'lat' and 'long'
        # 'lapply' applies the following function to all sub-sequences
        function(x) {
          # which of the indices are for males:
          male <- which(data[x, "sex"] == "M")
          # which of the indices are for females:
          female <- which(data[x, "sex"] == "F")
          # sample from the indices of males:
          s_male <- sample(male, min(length(male), length(female)))
          # sample from the indices of females:
          s_female <- sample(female, min(length(male), length(female)))
          # combine both sampled indices:
          x[c(s_male, s_female)]                
        })), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.

The result:

   lat long sex score
9   45  -68   M     3
11  45  -68   M     9
12  45  -68   F     8
8   45  -68   F     2
7   58  139   M     3
5   58  139   F     5
20  64    1   M     1
15  64    1   M     6
19  64    1   F     5
16  64    1   F     7
like image 99
Sven Hohenstein Avatar answered Jan 23 '26 01:01

Sven Hohenstein


Below is a quick way to go about it, which involves creating a temporary column of the lat-long combination. We split the DF according to this column, count the M/F in each split, sample appropriately, then re-combine.

# First, We call the dataframe something other than "data" ;) 
mydf <- data.frame(lat, long, sex, score)

# create a new data frame with a temporary column, which concatenates the lat & long. 
mydf.new <- data.frame(mydf, latlong=paste(mydf$lat, mydf$long, sep=","))

# Split the data frame according to the lat-long location
mydf.splat <- split(mydf.new, mydf.new$latlong)

# eg, taking a look at one of our tables:
mydf.splat[[4]]

sampled <- 
  lapply(mydf.splat, function(tabl) {
    Ms <- sum(tabl$sex=="M")
    Fs <- sum(tabl$sex=="F")

    if(Fs == 0 || Ms ==0)    # If either is zero, we drop that location
      return(NULL)

    if(Fs == Ms)   # If they are both equal, no need to sample. 
      return(tabl)

    # If number of Females less than Males, return all Females 
    #    and sample from males in ammount equal to Females
    if (Fs < Ms)   
      return(tabl[c(which(tabl$sex=="F"), sample(which(tabl$sex=="M"), Fs)),   ])

    if (Ms < Fs)  # same as previous, but for Males < Femals
      return(tabl[c(which(tabl$sex=="M"), sample(which(tabl$sex=="F"), Ms)),   ])

    stop("hmmm... something went wrong.")  ## We should never hit this line, but just in case. 
  })

# Flatten into a single table
mydf.new <- do.call(rbind, sampled)

# Clean up
row.names(mydf.new) <- NULL  # remove the row names that were added
mydf.new$latlong <- NULL     # remove the temporary column that we added

RESULTS

mydf.new

#    lat long sex score
# 1   45  -68   F     2
# 2   45  -68   F     8
# 3   45  -68   M     9
# 4   45  -68   M     3
# 5   58  139   F     5
# 6   58  139   M     3
# 7   64    1   M     6
# 8   64    1   M     1
# 9   64    1   F     7
# 10  64    1   F     5
like image 40
Ricardo Saporta Avatar answered Jan 23 '26 02:01

Ricardo Saporta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!