Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent rows with duplicated indices / keys to be appended to a data.frame?

I have data in which the combination of two variables ("ManufactererId" and "ProductId") constitute unique keys / identifiers. The data looks like this:

my.data <- data.frame(ManufactererId = c(1, 1, 2, 2),
                      ProductId = c(1, 2, 1, 7),
                      Price = c(12.99, 149.00, 0.99, 3.99))
my.data
#   ManufactererId ProductId  Price
# 1              1         1  12.99
# 2              1         2 149.00
# 3              2         1   0.99
# 4              2         7   3.99

I want to ensure that I cannot accidentally add another row with a pair of ManufactererId - ProductId equal to what is already present in the table (like the unique constraint on a database table).

That is, if I try to add a row with ManufactererId = 2 and ProductId = 7 to my data frame:

my.data <- rbind(my.data, data.frame(ManufactererId = 2, ProductId = 7, Price = 120.00))

...it should fail with an error. How can this be achieved?

Or should I use a different data type?

like image 961
kmccoy Avatar asked Mar 20 '16 10:03

kmccoy


People also ask

How do I remove duplicate rows in a data frame?

By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.

How do you eliminate duplicate rows in Python?

Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.

How do you remove duplicates from a DataFrame column?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.


2 Answers

1) zoo Whether this is convenient or not depends on what operations you want to do but zoo objects have unique indexes. We can construct a text index by pasting together the two Id columns.

library(zoo)
z <- with(my.data, zoo(Price, paste(ManufactererId, ProductId)))

z <- c(z, zoo(90, "1 1")) # Error, not appended
z <- c(z, zoo(90, "1 3")) # OK

Note that the data part of a zoo object can be a vector as shown above or a matrix in case you have more than just Price in the data.

2) SQLite This could be done with any of a number of databases but we will use SQLite here. First we create a table with a unique index in an SQLite database and then we insert rows.

library(RSQLite)

con <- dbConnect(SQLite())
dbWriteTable(con, "my", my.data, row.names = FALSE)
dbGetQuery(con, "create unique index ix on my(ManufactererId, ProductId)")

dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 1, 99)) # error
dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 13, 90)) # OK
like image 100
G. Grothendieck Avatar answered Sep 22 '22 10:09

G. Grothendieck


You can do something like this: Where keys is your unique-key

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(ncol(DF) != ncol(to_be_appended) || !all(names(DF) %in% names(to_be_appended))){
    stop("must have the same columns")
  }
  if(nrow(merge(DF, to_be_appended, by=keys))==0){
    rbind(DF, to_be_appended)
  } else {
    stop("Trying to append douplicated indices")
  }
}

Test it:

to_be_appended = data.frame(ManufactererId=2,ProductId=17,Price=3.99)
append_save(my.data, to_be_appended) # works
to_be_appended_err = data.frame(ManufactererId=2,ProductId=7,Price=3.99)
append_save(my.data, to_be_appended_err) # error

If you append data only based on key-columns you could use data.table as follows:

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(!all(keys %in% names(to_be_appended))){
    stop("key-columns must be present")
  }
  if(nrow(data.table::merge(DF, to_be_appended, on=keys))==0){
    data.table::setDF(data.table::rbindlist(list(DF, to_be_appended), fill = TRUE))[]
  } else {
    stop("Trying to append douplicated indices")
  }
}
like image 26
Rentrop Avatar answered Sep 23 '22 10:09

Rentrop