Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete a row by reference in data.table?

Tags:

r

data.table

People also ask

How do you delete a row in a DataTable?

There are two methods you can use to delete a DataRow object from a DataTable object: the Remove method of the DataRowCollection object, and the Delete method of the DataRow object. Whereas the Remove method deletes a DataRow from the DataRowCollection, the Delete method only marks the row for deletion.

How do I delete a row based on a list?

Select all the rows that contain 1 in the helper column by pressing the Ctrl key from the keyboard. Then right-click on the selection and press Delete.

How do I delete all rows containing certain data?

To delete rows that contain these cells, right-click anywhere in the data range and from the drop-down menu, choose Delete.


Good question. data.table can't delete rows by reference yet.

data.table can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert and delete. A row delete would use memmove in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.

On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; e.g., a growing time series.


It's filed as an issue: Delete rows by reference.


the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:

DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, (col) := 1:1e6] }
keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
for (col in cols){
  DT.subset[, (col) := DT[[col]][keep.idxs]]
  DT[, (col) := NULL] #delete
}

Here is a working function based on @vc273's answer and @Frank's feedback.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

PS. Since I am new to SO, I could not add comment to @vc273's thread :-(


Instead or trying to set to NULL, try setting to NA (matching the NA-type for the first column)

set(DT,1:2, 1:3 ,NA_character_)

The topic is still interesting many people (me included).

What about that? I used assign to replace the glovalenv and the code described previously. It would be better to capture the original environment but at least in globalenv it is memory efficient and acts like a change by ref.

delete <- function(DT, del.idxs) 
{ 
  varname = deparse(substitute(DT))

  keep.idxs <- setdiff(DT[, .I], del.idxs)
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs])
  setnames(DT.subset, cols[1])

  for (col in cols[2:length(cols)]) 
  {
    DT.subset[, (col) := DT[[col]][keep.idxs]]
    DT[, (col) := NULL];  # delete
  }

  assign(varname, DT.subset, envir = globalenv())
  return(invisible())
}

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
delete(DT, 3)

Here are some strategies I have used. I believe a .ROW function may be coming. None of these approaches below are fast. These are some strategies a little beyond subsets or filtering. I tried to think like dba just trying to clean up data. As noted above, you can select or remove rows in data.table:

data(iris)
iris <- data.table(iris)

iris[3] # Select row three

iris[-3] # Remove row three

You can also use .SD to select or remove rows:

iris[,.SD[3]] # Select row three

iris[,.SD[3:6],by=,.(Species)] # Select row 3 - 6 for each Species

iris[,.SD[-3]] # Remove row three

iris[,.SD[-3:-6],by=,.(Species)] # Remove row 3 - 6 for each Species

Note: .SD creates a subset of the original data and allows you to do quite a bit of work in j or subsequent data.table. See https://stackoverflow.com/a/47406952/305675. Here I ordered my irises by Sepal Length, take a specified Sepal.Length as minimum,select the top three (by Sepal Length) of all Species and return all accompanying data:

iris[order(-Sepal.Length)][Sepal.Length > 3,.SD[1:3],by=,.(Species)]

The approaches above all reorder a data.table sequentially when removing rows. You can transpose a data.table and remove or replace the old rows which are now transposed columns. When using ':=NULL' to remove a transposed row, the subsequent column name is removed as well:

m_iris <- data.table(t(iris))[,V3:=NULL] # V3 column removed

d_iris <- data.table(t(iris))[,V3:=V2] # V3 column replaced with V2

When you transpose the data.frame back to a data.table, you may want to rename from the original data.table and restore class attributes in the case of deletion. Applying ":=NULL" to a now transposed data.table creates all character classes.

m_iris <- data.table(t(d_iris));
setnames(d_iris,names(iris))

d_iris <- data.table(t(m_iris));
setnames(m_iris,names(iris))

You may just want to remove duplicate rows which you can do with or without a Key:

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]     

d_iris[!duplicated(Key),]

d_iris[!duplicated(paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)),]  

It is also possible to add an incremental counter with '.I'. You can then search for duplicated keys or fields and remove them by removing the record with the counter. This is computationally expensive, but has some advantages since you can print the lines to be removed.

d_iris[,I:=.I,] # add a counter field

d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]

for(i in d_iris[duplicated(Key),I]) {print(i)} # See lines with duplicated Key or Field

for(i in d_iris[duplicated(Key),I]) {d_iris <- d_iris[!I == i,]} # Remove lines with duplicated Key or any particular field.

You can also just fill a row with 0s or NAs and then use an i query to delete them:

 X 
   x v foo
1: c 8   4
2: b 7   2

X[1] <- c(0)

X
   x v foo
1: 0 0   0
2: b 7   2

X[2] <- c(NA)
X
    x  v foo
1:  0  0   0
2: NA NA  NA

X <- X[x != 0,]
X <- X[!is.na(x),]