I'm using the cut
function to split my data into groups using the max/min range. here is an example of the code that I am using:
# sample data frame - used to identify intial groups
testdf <- data.frame(a = c(1:100), b = rnorm(100))
# split into groups based on ranges
k <- 20 # number of groups
# split into groups, keep code
testdf$groupCode <- cut(testdf$b, breaks = k, labels = FALSE)
# store factor information
testdf$group <- cut(testdf$b, breaks = k)
head(testdf)
I want to use the factor groupings identified to split another data frame up, but I'm not sure how to use factors to deal with this. I think my code structure should be roughly as follows:
# this is the data I want to categorize based on previous groupings
datadf <- data.frame(a = c(1:100), b = rnorm(100))
datadf$groupCode <- function(x){return(groupCode)}
I see that the factor data is structure as follows, but I don't know how to use it properly:
testdf$group[0]
factor(0)
20 Levels: (-2.15,-1.91] (-1.91,-1.67] (-1.67,-1.44] (-1.44,-1.2] ... (2.34,2.58]
Two functions that I have been experimenting with (but which do not work) are as follows:
# get group code
nearestCode <- function( number, groups ){
return( which( abs( groups-number )== min( abs(groups-number) ) ) )
}
nearestCode(7, testdf$group[0])
And also experimenting with the which
function.
which(7, testdf$group[0])
What is the best way of identifying groupings and applying them to another dataframe?
I would have used:
testdf$groupCode <- cut(testdf$b, breaks =
quantile(testdf$b, seq(0,1, by=0.05), na.rm=TRUE))
grpbrks <- quantile(testdf$b, seq(0,1, by=0.05), na.rm=TRUE)
Then you can use:
findInterval(newdat$newvar, grpbrks) # to group new data
And you then won't need to screw around with recovering the breaks from the labels or the data.
Thinking about, I guess you could also use:
cut(newdat$newvar, grpbrks) # more isomorphic to original categorization I suppose
Screwing around with some regular expressions seems to be the only way of actually returning the value of an object resulting from cut
.
The following code does the necessary screwing:
cut_breaks <- function(x){
first <- as.numeric(gsub(".{1}(.+),.*", "\\1", levels(x))[1])
other <- as.numeric(gsub(".+,(.*).{1}", "\\1", levels(x)))
c(first, other)
}
set.seed(1)
x <- rnorm(100)
cut1 <- cut(x, breaks=20)
cut_breaks(cut1)
[1] -2.2200 -1.9900 -1.7600 -1.5300 -1.2900 -1.0600 -0.8320 -0.6000 -0.3690
[10] -0.1380 0.0935 0.3250 0.5560 0.7870 1.0200 1.2500 1.4800 1.7100
[19] 1.9400 2.1700 2.4100
levels(cut1)
[1] "(-2.22,-1.99]" "(-1.99,-1.76]" "(-1.76,-1.53]" "(-1.53,-1.29]"
[5] "(-1.29,-1.06]" "(-1.06,-0.832]" "(-0.832,-0.6]" "(-0.6,-0.369]"
[9] "(-0.369,-0.138]" "(-0.138,0.0935]" "(0.0935,0.325]" "(0.325,0.556]"
[13] "(0.556,0.787]" "(0.787,1.02]" "(1.02,1.25]" "(1.25,1.48]"
[17] "(1.48,1.71]" "(1.71,1.94]" "(1.94,2.17]" "(2.17,2.41]"
You can then pass these break values to cut
using the breaks=
parameter to make your second cut.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With