I have 2 lists of numbers (col1 & col2) below. I'd like to add 2 columns (col3 & col4) that do the following. col3 numbers col2 starting at 1 every time col2 changes (e.g. from b2 to b3). col4 is TRUE on the last occurrence for each value in col2. The data is sorted by col1, then col2 to begin. Note. values in col2 can occur for different values of col1. (i.e. I can have b1 for every value of col 1 (a, b, c)) I can get this working fine for ~5000 rows (~6 sec), but scaling to ~1 million rows it hangs up. Here is my code <pre class="prettyprint"><code>df$col3 <- 0 df$col4 <- FALSE stopHere <- nrow(df) c1 <- 'xxx' c2 <- 'xxx' for (i in 1:stopHere) { if (df[i, "col1"] != c1) { c2 <- 0 c3 <- 1 c1 <- df[i, "col1"] } if (df[i, "col2"] != c2) { df[i - 1, "col4"] <- TRUE c3 <- 1 c2 <- df[i, "col2"] } df[i, "col3"] <- c3 c3 <- c3 + 1 } </code></pre> This is my desired output. <pre class="prettyprint"><code>1 a b1 1 FALSE 2 a b1 2 FALSE 3 a b1 3 TRUE 4 a b2 1 FALSE 5 a b2 2 TRUE 6 a b3 1 FALSE 7 a b3 2 FALSE 8 a b3 3 FALSE 9 a b3 4 FALSE 10 a b3 5 TRUE 11 b b1 1 FALSE 12 b b1 2 FALSE 13 b b1 3 FALSE 14 b b1 4 TRUE 15 b b2 1 FALSE 16 b b2 2 FALSE 17 b b2 3 FALSE 18 b b2 4 TRUE 19 c b1 1 TRUE 20 c b2 1 FALSE 21 c b2 2 FALSE 22 c b2 3 TRUE 23 c b3 1 FALSE 24 c b3 2 TRUE 25 c b4 1 FALSE 26 c b4 2 FALSE 27 c b4 3 FALSE 28 c b4 4 FALSE </code></pre>

Here is a vectorized solution that works for your sample data: <pre class="prettyprint"><code>dat <- data.frame( V1 = rep(letters[1:3], c(10, 8, 10)), V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4)) ) </code></pre> Create columns 3 and 4 <pre class="prettyprint"><code>zz <- rle(as.character(dat$V2))$lengths dat$V3 <- sequence(zz) dat$V4 <- FALSE dat$V4[head(cumsum(zz), -1)] <- TRUE </code></pre> The results: <pre class="prettyprint"><code>dat V1 V2 V3 V4 1 a b1 1 FALSE 2 a b1 2 FALSE 3 a b1 3 TRUE 4 a b2 1 FALSE 5 a b2 2 TRUE 6 a b3 1 FALSE 7 a b3 2 FALSE 8 a b3 3 FALSE 9 a b3 4 FALSE 10 a b3 5 TRUE 11 b b1 1 FALSE 12 b b1 2 FALSE 13 b b1 3 FALSE 14 b b1 4 TRUE 15 b b2 1 FALSE 16 b b2 2 FALSE 17 b b2 3 FALSE 18 b b2 4 TRUE 19 c b1 1 TRUE 20 c b2 1 FALSE 21 c b2 2 FALSE 22 c b2 3 TRUE 23 c b3 1 FALSE 24 c b3 2 TRUE 25 c b4 1 FALSE 26 c b4 2 FALSE 27 c b4 3 FALSE 28 c b4 4 FALSE </code></pre>

How to get a numbered list renumbering when a value changes

Tags:

dataframe

r

I have 2 lists of numbers (col1 & col2) below. I'd like to add 2 columns (col3 & col4) that do the following. col3 numbers col2 starting at 1 every time col2 changes (e.g. from b2 to b3). col4 is TRUE on the last occurrence for each value in col2.

The data is sorted by col1, then col2 to begin. Note. values in col2 can occur for different values of col1. (i.e. I can have b1 for every value of col 1 (a, b, c))

I can get this working fine for ~5000 rows (~6 sec), but scaling to ~1 million rows it hangs up.

Here is my code

df$col3 <- 0
df$col4 <- FALSE
stopHere <- nrow(df)
c1 <- 'xxx'
c2 <- 'xxx'
for (i in 1:stopHere) {
  if (df[i, "col1"] != c1) {
    c2 <- 0
    c3 <- 1
    c1 <- df[i, "col1"]
  }
  if (df[i, "col2"] != c2) {
    df[i - 1, "col4"] <- TRUE
    c3 <- 1
    c2  <- df[i, "col2"]
  }
  df[i, "col3"] <- c3
  c3  <- c3 + 1
}

This is my desired output.

1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4 FALSE

968

asked Oct 18 '11 19:10

drbv

1 Answers

Here is a vectorized solution that works for your sample data:

dat <- data.frame(
  V1 = rep(letters[1:3], c(10, 8, 10)),
  V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
  )

Create columns 3 and 4

zz <- rle(as.character(dat$V2))$lengths
dat$V3 <- sequence(zz)
dat$V4 <- FALSE
dat$V4[head(cumsum(zz), -1)] <- TRUE

The results:

dat
   V1 V2 V3    V4
1   a b1  1 FALSE
2   a b1  2 FALSE
3   a b1  3  TRUE
4   a b2  1 FALSE
5   a b2  2  TRUE
6   a b3  1 FALSE
7   a b3  2 FALSE
8   a b3  3 FALSE
9   a b3  4 FALSE
10  a b3  5  TRUE
11  b b1  1 FALSE
12  b b1  2 FALSE
13  b b1  3 FALSE
14  b b1  4  TRUE
15  b b2  1 FALSE
16  b b2  2 FALSE
17  b b2  3 FALSE
18  b b2  4  TRUE
19  c b1  1  TRUE
20  c b2  1 FALSE
21  c b2  2 FALSE
22  c b2  3  TRUE
23  c b3  1 FALSE
24  c b3  2  TRUE
25  c b4  1 FALSE
26  c b4  2 FALSE
27  c b4  3 FALSE
28  c b4  4 FALSE

answered Nov 13 '22 04:11

Andrie

Related questions
                            
                                Why is bam from mgcv slow for some data?
                            
                                Decrease margins between plots when using cowplot
                            
                                Installing R on Linux: configure: WARNING: you cannot build PDF versions of the R manuals
                            
                                How to correctly convert NaN to NA
                            
                                using tidyr unnest with NULL values
                            
                                Find column number that satisfies condition
                            
                                curl package not available for several R packages
                            
                                Change legend title ggplot2 [duplicate]
                            
                                How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?
                            
                                How to know the operations made to calculate the Levenshtein distance between strings?
                            
                                Creating a new column conditionally based on previous n rows
                            
                                Any faster way to check if lists in a list are equivalent?
                            
                                Force stop or halt on error
                            
                                Plotting functions on top of datapoints in R
                            
                                vector of variable names in R
                            
                                Turning RData file into script files
                            
                                What are S1 and S2 classes?
                            
                                How to label graph with the mean of the values using ggplot2
                            
                                R: replace NA with item from vector
                            
                                How to add gaussian curve to histogram created with qplot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With