I have a question that will hopefully not be a huge obstacle for advanced R users...
test.data <- data.frame(case = c(1, 1, 1, 2, 2, 2, 3),
year = c(2006, 2007, 2008, 2007, 2006, 2008, 2006),
level = c(10, 20, 20, 12, 20, 20, 20))
As you might be able to see, I have multiple occurrences for each case, distinguished by the year. The values of level differ within a case, and I would like to correct that by setting each value of level to the minimum level of a given case. In this example, each value of level for case=1 should be 10, and each value of level for case=2 should be 12. For any specific case I could do the following:
test.data$level[test.data$case==1] <- min(test.data$level[test.data$case==1])
But since I have several hundred cases, this would take quite long. Therefore, I would like to ask if you have a quicker solution.
You can try
library(data.table)
setDT(test.data)[, level:= min(level, na.rm=TRUE), case]
# case year level
#1: 1 2006 10
#2: 1 2007 10
#3: 1 2008 10
#4: 2 2007 12
#5: 2 2006 12
#6: 2 2008 12
#7: 3 2006 20
Or using dplyr
library(dplyr)
test.data %>%
group_by(case) %>%
mutate(level= min(level, na.rm=TRUE))
# case year level
#1 1 2006 10
#2 1 2007 10
#3 1 2008 10
#4 2 2007 12
#5 2 2006 12
#6 2 2008 12
#7 3 2006 20
Or using sqldf/dplyr
library(sqldf)
library(dplyr)
sqldf('select * from "test.data"
left join(select "case",
min(level) as Level
from "test.data"
group by "case")
using ("case")') %>%
select(-level)
# case year Level
#1 1 2006 10
#2 1 2007 10
#3 1 2008 10
#4 2 2007 12
#5 2 2006 12
#6 2 2008 12
#7 3 2006 20
Or a modification suggested by @G.Grothendieck using only sqldf
sqldf('select "case", year, "min(level)" as Level
from "test.data"
left join(select "case", min(level)
from "test.data"
group by "case")
using ("case")')
#1 1 2006 10
#2 1 2007 10
#3 1 2008 10
#4 2 2007 12
#5 2 2006 12
#6 2 2008 12
#7 3 2006 20
Or using base R
test.data$level <- with(test.data, ave(level, case, FUN=min))
Here's a classic using base R functions.
# may not be optimal for larger datasets due to merge
min.lvl <- aggregate(level ~ case, data = test.data, FUN = min)
merge(x = test.data, y = min.lvl, by = "case", all.x = TRUE, sort = FALSE)
case year level.x level.y
1 1 2006 10 10
2 1 2007 20 10
3 1 2008 20 10
4 2 2007 12 12
5 2 2006 20 12
6 2 2008 20 12
7 3 2006 20 20
Second vanilla option of doing things would be
new.data <- by(data = test.data, INDICES = test.data$case, FUN = function(x) {
x$level <- min(x$level)
x
})
do.call("rbind", new.data)
case year level
1.1 1 2006 10
1.2 1 2007 10
1.3 1 2008 10
2.4 2 2007 12
2.5 2 2006 12
2.6 2 2008 12
3 3 2006 20
Alternative using doBy
library(doBy)
summaryBy(level ~ case, id=~ year, test.data,
full.dimension=TRUE, keep.names=TRUE, min)
# case level year
#1: 1 10 2006
#2: 1 10 2006
#3: 1 10 2006
#4: 2 12 2007
#5: 2 12 2007
#6: 2 12 2007
#7: 3 20 2006
Or to be more compact
library(plyr)
ddply(test.data, .(case), mutate, level = min(level))
# case year level
#1 1 2006 10
#2 1 2007 10
#3 1 2008 10
#4 2 2007 12
#5 2 2006 12
#6 2 2008 12
#7 3 2006 20
Another base R method using lapply
do.call(rbind,lapply(split(test.data, test.data$case),
function(x){x$level = min(x$level); x}))
# case year level
#1: 1 2006 10
#2: 1 2007 10
#3: 1 2008 10
#4: 2 2007 12
#5: 2 2006 12
#6: 2 2008 12
#7: 3 2006 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With