Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Define and apply custom bins on a dataframe

Using python I have created following data frame which contains similarity values:

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard 1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000 2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000 3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353 4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000 5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000 6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000 

I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that

Pseudocode:

if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)    bin = 1 if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)    bin = 2 if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)    bin = 3 if (cosinFcolor > 0.8 & cosinFcolor <=0.9)    bin = 4 if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)    bin = 5 else    bin = 0 

Based on above logic, I want to build a data frame

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard 1       3         0         0            1           1        0               0 

How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

like image 340
add-semi-colons Avatar asked Aug 15 '12 02:08

add-semi-colons


People also ask

What are bins in pandas?

Binning also known as bucketing or discretization is a common data pre-processing technique used to group intervals of continuous data into “bins” or “buckets”. In this article we will discuss 4 methods for binning numerical values using python Pandas library.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.


2 Answers

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)  cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6) cuts[cuts=="6"] <- "0" cuts <- as.data.frame(cuts)    cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1           3         0            0           1         1            0       0 2           0         0            5           0         2            2       0 3           1         0            2           0         0            1       0 4           0         0            3           0         1            1       0 5           1         3            1           0         4            0       0 6           0         0            1           0         0            0       0 

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))  [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA>  Levels: (3,5] (5,7] 

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)  [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA> 

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)  [1] 1 1 1 2 2 3 3 4 4 4 Levels: 1 2 3 4 

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"  [1] 1 1 1 2 2 3 3 1 1 1 Levels: 1 2 3 4 

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

Hope that helps.

like image 160
sebastian-c Avatar answered Sep 30 '22 14:09

sebastian-c


You can also use findInterval:

findInterval(seq(0, 1, l=20), seq(0.5, 1, by=0.1))  ## [1] 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 4 5 5 
like image 30
mnel Avatar answered Sep 30 '22 16:09

mnel