Here I make a new column to indicate whether myData is above or below its median
### MedianSplits based on Whole Data
#create some test data
myDataFrame=data.frame(myData=runif(15),myFactor=rep(c("A","B","C"),5))
#create column showing median split
myBreaks= quantile(myDataFrame$myData,c(0,.5,1))
myDataFrame$MedianSplitWholeData = cut(
myDataFrame$myData,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
#Check if it's correct
myDataFrame$AboveWholeMedian = myDataFrame$myData > median(myDataFrame$myData)
myDataFrame
Works fine. Now I want to do the same thing, but compute the median splits within each level of myFactor.
I've come up with this:
#Median splits within factor levels
byOutput=by(myDataFrame$myData,myDataFrame$myFactor, function (x) {
myBreaks= quantile(x,c(0,.5,1))
MedianSplitByGroup=cut(x,
breaks=myBreaks,
include.lowest=TRUE,
labels=c("Below","Above"))
MedianSplitByGroup
})
byOutput contains what I want. It categorizes each element of factors A, B, and C correctly. However I'd like to create a new column, myDataFrame$FactorLevelMedianSplit, that shows the newly-computed median split.
How do you convert the output of the "by" command into a useful data-frame column?
I think perhaps the "by" command is not R-like way to do this ...
Update:
With Thierry's example of how to use factor() cleverly, and upon discovering the "ave" function in Spector's book, I've found this solution, which requires no additional packages.
myDataFrame$MediansByFactor=ave(
myDataFrame$myData,
myDataFrame$myFactor,
FUN=median)
myDataFrame$FactorLevelMedianSplit = factor(
myDataFrame$myData>myDataFrame$MediansByFactor,
levels = c(TRUE, FALSE),
labels = c("Above", "Below"))
Here is a solution using the plyr package.
myDataFrame <- data.frame(myData=runif(15),myFactor=rep(c("A","B","C"),5))
library(plyr)
ddply(myDataFrame, "myFactor", function(x){
x$Median <- median(x$myData)
x$FactorLevelMedianSplit <- factor(x$myData <= x$Median, levels = c(TRUE, FALSE), labels = c("Below", "Above"))
x
})
Here is a hack-ish way. Hadley may come with something more elegant:
To start, we simple concatenate the by
output:
R> do.call(c,byOutput)
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5
1 2 2 1 1 1 1 2 1 2 1 2 1 1 2
and what matters that we get the factor levels 1 and 2 here which we can use to re-index a new factor with those levels:
R> c("Below","Above")[do.call(c,byOutput)]
[1] "Below" "Above" "Above" "Below" "Below" "Below" "Below" "Above"
[8] "Below" "Above" "Below" "Above" "Below" "Below" "Above"
R> as.factor(c("Below","Above")[do.call(c,byOutput)])
[1] Below Above Above Below Below Below Below Above Below Above
[11] Below Above Below Below Above
Levels: Above Below
which we can then assign into the data.frame
you wanted to modify:
R> myDataFrame$FactorLevelMedianSplit <-
as.factor(c("Below","Above")[do.call(c,byOutput)])
Update: Never mind, we'd need to reindex myDataFrame to be sorted A A ... A B ... B C ... C as well before we add the new column. Left as an exercise...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With