Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number equal rows in data.frame

Tags:

dataframe

r

plyr

I have a data frame that looks like this:

df <- data.frame(
  Logical = c(TRUE,FALSE,FALSE,FALSE,FALSE,FALSE),
  A = c(1,2,3,2,3,1),
  B = c(1,0.05,0.80,0.05,0.80,1),
  C = c(1,10.80,15,10.80,15,1))

Which looks like:

  Logical A    B    C
1    TRUE 1 1.00  1.0
2   FALSE 2 0.05 10.8
3   FALSE 3 0.80 15.0
4   FALSE 2 0.05 10.8
5   FALSE 3 0.80 15.0
6   FALSE 1 1.00  1.0

I want to add a new variable, D, which is an integer based on the following rules: either a 0 if df$Logical is TRUE, or an integer that is the same for all rows of variables A, B and C that are approximately (because they are doubles, so within a floating point margin of error) equal, starting at 1.

The expected output here:

  Logical A    B    C D
1    TRUE 1 1.00  1.0 0
2   FALSE 2 0.05 10.8 1
3   FALSE 3 0.80 15.0 2
4   FALSE 2 0.05 10.8 1
5   FALSE 3 0.80 15.0 2
6   FALSE 1 1.00  1.0 3

First row gets 0 because Logical is TRUE, second and fourth row get 1 because the variables A, B and C are approximately equal there, same for second and fifth row. Row six gets a 3 because it is the next unique row. Note that the order of integers assigned in D is irrelevant except for the 0. e.g., rows 2 and 4 could also be assigned 2 as long as this integer is unique in the other cases of D.


I have considered using aggregating functions. For example using ddply:

library("plyr")
df$foo <- 1:nrow(df)
foo <- dlply(df,.(A,B,C),'[[',"foo")
df$D <- 0
for (i in 1:length(foo)) df$D[foo[[i]]] <- i
df$D[df$Logical] <- 0

works, but I am not sure how well this will do with floating point errors (I guess I could round the values here before this call and it should be quite stable though). With a loop it is quite easy:

df$D <- 0
c <- 1
for (i in 1:nrow(df))
{
  if (!isTRUE(df$Logical[i]) & df$D[i]==0)
  {
    par <- sapply(1:nrow(df),function(j)!df$Logical[j]&isTRUE(all.equal(unlist(df[j,c("A" ,"B", "C")]),unlist(df[i,c("A" ,"B", "C")]))))
    df$D[par] <- c
    c <- c+1
  }
}

but this is very slow for larger data frames.

like image 605
Sacha Epskamp Avatar asked Mar 03 '26 17:03

Sacha Epskamp


1 Answers

As per Matthew Dowle's comments below, data.table can group numeric values, distinguishing between them with .Machine$double.eps^.5 tolerance. With that in mind, a data.table solution should work:

library(data.table)

DT <- as.data.table(df)

DT[, D := 0]

.GRP <- 0

DT[!Logical, D := .GRP <- .GRP + 1, by = "A,B,C"]

#    Logical A    B    C foo D
# 1:    TRUE 1 1.00  1.0   1 0
# 2:   FALSE 2 0.05 10.8   2 1
# 3:   FALSE 3 0.80 15.0   3 2
# 4:   FALSE 2 0.05 10.8   4 1
# 5:   FALSE 3 0.80 15.0   5 2
# 6:   FALSE 1 1.00  1.0   6 3

As Matthew Dowle writes here, .GRP is implemented in data.table 1.8.3, but I'm still with 1.8.2


Follow up from comments, here's the NEWS item from 1.8.2. Will add to ?data.table, thanks for highlighting!

Numeric columns (type double) are now allowed in keys and ad hoc by. J() and SJ() no longer coerce double to integer. i join columns which mismatch on numeric type are coerced silently to match the type of x's join column. Two floating point values are considered equal (by grouping and binary search joins) if their difference is within sqrt(.Machine$double.eps), by default. See example in ?unique.data.table. Completes FRs #951, #1609 and #1075. This paves the way for other atomic types which use double (such as POSIXct and bit64). Thanks to Chris Neff for beta testing and finding problems with keys of two numeric columns (bug #2004), fixed and tests added.

like image 130
BenBarnes Avatar answered Mar 05 '26 12:03

BenBarnes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!