First things, first. Here are my data:
lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)
data <- data.frame(lat, long, sex, score)
The data should look like this:
lat long sex score
1 12 -14 M 2
2 12 -14 M 6
3 58 139 M 3
4 58 139 M 6
5 58 139 F 5
6 58 139 M 4
7 58 139 M 3
8 45 -68 F 2
9 45 -68 M 3
10 45 -68 M 9
11 45 -68 M 9
12 45 -68 F 8
13 45 1 M 6
14 64 1 F 5
15 64 1 M 6
16 64 1 F 7
17 64 1 F 5
18 64 1 F 7
19 64 1 F 5
20 64 1 M 1
I am at my wits end trying to figure this one out. The variables are latitude, longitude, sex and score. I would like to have an equal number of males and females within each location (i.e. with the same longitude and latitude). For instance, the second location (rows 3 to 7) has only one female. This female should be retained and one male from the remaining individuals should also be retained (by random sampling, perhaps). Some locations have only information about one sex, e.g. the first location (rows 1 and 2) has only data on males. The rows from this location should be dropped (since there are no females). All going according to plan the final dataset should look something like this:
lat2 long2 sex2 score2
1 58 139 F 5
2 58 139 M 4
3 45 -68 F 2
4 45 -68 M 3
5 45 -68 M 9
6 45 -68 F 8
7 64 1 M 6
8 64 1 F 5
9 64 1 F 7
10 64 1 M 1
Any help would be appreciated.
Here's a solution with lapply:
data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
# 'split' splits the sequence of row numbers (indices) along the unique
# combinations of 'lat' and 'long'
# 'lapply' applies the following function to all sub-sequences
function(x) {
# which of the indices are for males:
male <- which(data[x, "sex"] == "M")
# which of the indices are for females:
female <- which(data[x, "sex"] == "F")
# sample from the indices of males:
s_male <- sample(male, min(length(male), length(female)))
# sample from the indices of females:
s_female <- sample(female, min(length(male), length(female)))
# combine both sampled indices:
x[c(s_male, s_female)]
})), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.
The result:
lat long sex score
9 45 -68 M 3
11 45 -68 M 9
12 45 -68 F 8
8 45 -68 F 2
7 58 139 M 3
5 58 139 F 5
20 64 1 M 1
15 64 1 M 6
19 64 1 F 5
16 64 1 F 7
Below is a quick way to go about it, which involves creating a temporary column of the lat-long combination. We split the DF according to this column, count the M/F in each split, sample appropriately, then re-combine.
# First, We call the dataframe something other than "data" ;)
mydf <- data.frame(lat, long, sex, score)
# create a new data frame with a temporary column, which concatenates the lat & long.
mydf.new <- data.frame(mydf, latlong=paste(mydf$lat, mydf$long, sep=","))
# Split the data frame according to the lat-long location
mydf.splat <- split(mydf.new, mydf.new$latlong)
# eg, taking a look at one of our tables:
mydf.splat[[4]]
sampled <-
lapply(mydf.splat, function(tabl) {
Ms <- sum(tabl$sex=="M")
Fs <- sum(tabl$sex=="F")
if(Fs == 0 || Ms ==0) # If either is zero, we drop that location
return(NULL)
if(Fs == Ms) # If they are both equal, no need to sample.
return(tabl)
# If number of Females less than Males, return all Females
# and sample from males in ammount equal to Females
if (Fs < Ms)
return(tabl[c(which(tabl$sex=="F"), sample(which(tabl$sex=="M"), Fs)), ])
if (Ms < Fs) # same as previous, but for Males < Femals
return(tabl[c(which(tabl$sex=="M"), sample(which(tabl$sex=="F"), Ms)), ])
stop("hmmm... something went wrong.") ## We should never hit this line, but just in case.
})
# Flatten into a single table
mydf.new <- do.call(rbind, sampled)
# Clean up
row.names(mydf.new) <- NULL # remove the row names that were added
mydf.new$latlong <- NULL # remove the temporary column that we added
mydf.new
# lat long sex score
# 1 45 -68 F 2
# 2 45 -68 F 8
# 3 45 -68 M 9
# 4 45 -68 M 3
# 5 58 139 F 5
# 6 58 139 M 3
# 7 64 1 M 6
# 8 64 1 M 1
# 9 64 1 F 7
# 10 64 1 F 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With