Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binning data according to a threshold?

Tags:

r

Let's say I have a response variable that rises and falls over time. Each time the response variable rises above a threshold, we have a new "Trial." That is, if I add a column Threshold that is TRUE whenever above a certain value, consecutive blocks of data points where Threshold is TRUE constitute a new trial.

Time <- seq(1, 10, by = 0.5)
Response <- abs(sin(Time))
Threshold <- Response > 0.6
data <- data.frame(Time, Response, Threshold)

Given Time, Response, and Threshold, how could I go about adding a Trial factor that has a new value for each group of TRUE thresholds? Something like this:

   Time   Response Threshold Trial
1   1.0 0.84147098      TRUE A
2   1.5 0.99749499      TRUE A
3   2.0 0.90929743      TRUE A
4   2.5 0.59847214     FALSE NA
5   3.0 0.14112001     FALSE NA
6   3.5 0.35078323     FALSE NA
7   4.0 0.75680250      TRUE B
8   4.5 0.97753012      TRUE B
9   5.0 0.95892427      TRUE B
10  5.5 0.70554033      TRUE B
11  6.0 0.27941550     FALSE NA
12  6.5 0.21511999     FALSE NA
13  7.0 0.65698660      TRUE C
14  7.5 0.93799998      TRUE C
15  8.0 0.98935825      TRUE C
16  8.5 0.79848711      TRUE C
17  9.0 0.41211849     FALSE NA
18  9.5 0.07515112     FALSE NA
19 10.0 0.54402111     FALSE NA
like image 323
sudo make install Avatar asked Jan 24 '14 04:01

sudo make install


People also ask

When should I apply binning?

Binning or discretization is used for the transformation of a continuous or numerical variable into a categorical feature. Binning of continuous variable introduces non-linearity and tends to improve the performance of the model. It can be also used to identify missing values or outliers.

What is meant by binning data?

Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values.

What is the binning process?

Binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals.


2 Answers

data$Trial <- factor(
  ifelse(data$Threshold, cumsum(!data$Threshold), NA), labels = c("A", "B", "C")
)

##   Time   Response Threshold Trial
## 1   1.0 0.84147098      TRUE     A
## 2   1.5 0.99749499      TRUE     A
## 3   2.0 0.90929743      TRUE     A
## 4   2.5 0.59847214     FALSE  <NA>
## 5   3.0 0.14112001     FALSE  <NA>
## 6   3.5 0.35078323     FALSE  <NA>
## 7   4.0 0.75680250      TRUE     B
## 8   4.5 0.97753012      TRUE     B
## 9   5.0 0.95892427      TRUE     B
## 10  5.5 0.70554033      TRUE     B
## 11  6.0 0.27941550     FALSE  <NA>
## 12  6.5 0.21511999     FALSE  <NA>
## 13  7.0 0.65698660      TRUE     C
## 14  7.5 0.93799998      TRUE     C
## 15  8.0 0.98935825      TRUE     C
## 16  8.5 0.79848711      TRUE     C
## 17  9.0 0.41211849     FALSE  <NA>
## 18  9.5 0.07515112     FALSE  <NA>
## 19 10.0 0.54402111     FALSE  <NA>
like image 200
Jake Burkhead Avatar answered Sep 29 '22 05:09

Jake Burkhead


Another possibility using rle:

r <- with(data, rle(Threshold))
len <- with(r, lengths[values])
n <- length(len)

trial <- rep(x = LETTERS[1:n], times = len)

data$Trial[data$Threshold] <- trial

data
like image 41
Henrik Avatar answered Sep 29 '22 05:09

Henrik