Using python I have created following data frame which contains similarity values: <pre class="prettyprint"><code> cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000 2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000 3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353 4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000 5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000 6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000 </code></pre> I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that Pseudocode: <pre class="prettyprint"><code>if (cosinFcolor > 0.5 & cosinFcolor <= 0.6) bin = 1 if (cosinFcolor > 0.6 & cosinFcolor <= 0.7) bin = 2 if (cosinFcolor > 0.7 & cosinFcolor =< 0.8) bin = 3 if (cosinFcolor > 0.8 & cosinFcolor <=0.9) bin = 4 if (cosinFcolor > 0.9 & cosinFcolor <= 1.0) bin = 5 else bin = 0 </code></pre> Based on above logic, I want to build a data frame <pre class="prettyprint"><code> cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1 3 0 0 1 1 0 0 </code></pre> How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

You can also use <code>findInterval</code>: <pre class="prettyprint"><code>findInterval(seq(0, 1, l=20), seq(0.5, 1, by=0.1)) ## [1] 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 4 5 5 </code></pre>

Define and apply custom bins on a dataframe

Tags:

summarize

Using python I have created following data frame which contains similarity values:

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard 1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000 2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000 3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353 4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000 5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000 6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000

I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that

Pseudocode:

if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)    bin = 1 if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)    bin = 2 if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)    bin = 3 if (cosinFcolor > 0.8 & cosinFcolor <=0.9)    bin = 4 if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)    bin = 5 else    bin = 0

Based on above logic, I want to build a data frame

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard 1       3         0         0            1           1        0               0

How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

340

asked Aug 15 '12 02:08

add-semi-colons

2 Answers

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)  cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6) cuts[cuts=="6"] <- "0" cuts <- as.data.frame(cuts)    cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard 1           3         0            0           1         1            0       0 2           0         0            5           0         2            2       0 3           1         0            2           0         0            1       0 4           0         0            3           0         1            1       0 5           1         3            1           0         4            0       0 6           0         0            1           0         0            0       0

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))  [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA>  Levels: (3,5] (5,7]

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)  [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA>

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)  [1] 1 1 1 2 2 3 3 4 4 4 Levels: 1 2 3 4

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"  [1] 1 1 1 2 2 3 3 1 1 1 Levels: 1 2 3 4

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

Hope that helps.

160

answered Sep 30 '22 14:09

sebastian-c

You can also use findInterval:

findInterval(seq(0, 1, l=20), seq(0.5, 1, by=0.1))  ## [1] 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 4 5 5

answered Sep 30 '22 16:09

mnel

Related questions
                            
                                How to optimize for integer parameters (and other discontinuous parameter space) in R?
                            
                                Merging more than 2 dataframes in R by rownames
                            
                                Combining matrices into an array in R
                            
                                Include a javascript file in Shiny app
                            
                                How do I create a copy of a data frame in R
                            
                                How can I concatenate a vector? [duplicate]
                            
                                How to remove rows of a matrix by row name, rather than numerical index?
                            
                                Set a Data Frame Column as the Index of R data.frame object
                            
                                Replace <NA> in a factor column
                            
                                How to remove rows with any zero value
                            
                                Interleave lists in R
                            
                                How to retry a statement on error?
                            
                                R - image of a pixel matrix?
                            
                                How to find all functions in an R package?
                            
                                install curl and readr on R
                            
                                How to plot 3D scatter diagram using ggplot?
                            
                                Subset elements in a list based on a logical condition
                            
                                How to one hot encode several categorical variables in R
                            
                                Replace logical values (TRUE / FALSE) with numeric (1 / 0)
                            
                                RStudio does not display any output in console after entering code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Define and apply custom bins on a dataframe

Tags:

dataframe

r

binning