Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping data into ranges in R

Tags:

r

grouping

Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.

> mydata  
id  name   marks gender  
1   a1    56     female  
2   a2    37      male  

I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.

like image 1000
Maddy Avatar asked Sep 07 '12 09:09

Maddy


People also ask

How can I group data in R?

The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories.

How do I separate data from a group in R?

Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters. It takes a vector or data frame as an argument and divides the information into groups. The syntax for this function is as follows: split(x, f, drop = FALSE, ...)

What does range () do in R?

In R programming language range is an efficient way of finding the difference between maximum and minimum values within a vector or a data frame. A range () function is defined as the interval between the largest (maximum) and smallest (minimum) data value within a vector or column in a data frame in R.

What is a grouped data frame in R?

A grouped data object is a special form of data frame consisting of one column of contiguous group boundaries and one or more columns of frequencies within each group. The function can create a grouped data object from two types of arguments.


2 Answers

I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:

mydata <- data.frame(
  id = 1:100,
  name = paste0("a",1:100),
  marks = sample(20:100,100,TRUE),
  gender = sample(c("female","male"),100,TRUE))

split(mydata,cut(mydata$marks,seq(20,100,by=10)))
like image 154
Sacha Epskamp Avatar answered Oct 02 '22 11:10

Sacha Epskamp


I think that @Sacha's answer should suffice for what you need to do, even if you have more than one set.

You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).

So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.

First, here's some sample data.

# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20, 
                      name = paste("a", 1:20, sep = ""),
                      marks = sample(20:100, 20, replace = TRUE),
                      gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
                      name = paste("b", 1:17, sep = ""),
                      marks = sample(30:100, 17, replace = TRUE),
                      gender = sample(c("F", "M"), 17, replace = TRUE))

Second, different options for "grouping".

  • Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.

    lapply(list(myData1 = myData1, myData2 = myData2), 
           function(x) x[x$marks >= 30 & x$marks <= 50, ])
    
  • Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.

    lapply(list(myData1 = myData1, myData2 = myData2), 
           function(x) split(x, x$marks >= 30 & x$marks <= 50))
    
  • Option 3: More flexible than the first. This is essentially @Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.

    lapply(list(myData1 = myData1, myData2 = myData2),
           function(x) split(x, cut(x$marks, 
                                    breaks = c(0, 30, 50, 75, 100), 
                                    include.lowest = TRUE)))
    
  • Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.

    # Combine the data. Assumes all the rownames are the same in both sets
    myDataALL <- rbind(myData1, myData2)
    # Extract just the group of scores you're interested in
    myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
    
  • Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.

    split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
    

I hope one of these options serves your needs!

like image 40
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 02 '22 10:10

A5C1D2H2I1M1N2O1R2T1