Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cut function in R - exclusive or am I double counting?

Tags:

r

Based off of a previous question I asked, which @Andrie answered, I have a question about the usage of the cut function and labels.

I'd like get summary statistics based on the range of number of times a user logs in.

Here is my data:

  # Get random numbers
  NumLogin <- round(runif(100,1,50))

  # Set the login range     
  LoginRange <- cut(NumLogin, 
       c(0,1,3,5,10,15,20,Inf), 
       labels=c('1','2','3-5','6-10','11-15','16-20','20+')
       )

Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?

Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?

DF <- data.frame(NumLogin, LoginRange)

?

Thanks

like image 364
mikebmassey Avatar asked Nov 22 '11 21:11

mikebmassey


People also ask

What is the cut function?

The cut command removes the selected data from its original position, while the copy command creates a duplicate; in both cases the selected data is kept in temporary storage (the clipboard). The data from the clipboard is later inserted wherever a paste command is issued.


1 Answers

The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:

cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]

As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.

When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!

LoginRange <- cut(NumLogin, 
   c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
   # c(0,1,3,5,10,15,20,Inf) + 0.5, 
   labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
   )
like image 161
Josh O'Brien Avatar answered Oct 01 '22 10:10

Josh O'Brien