Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choose one cell per row in data frame

I have a vector that tells me, for each row in a date frame, the column index for which the value in this row should be updated.

> set.seed(12008); n <- 10000; d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
> i <- sample.int(3, n, replace=TRUE)
> head(d); head(i)
  c1 c2 c3
1  1  2  3
2  2  4  6
3  3  6  9
4  4  8 12
5  5 10 15
6  6 12 18
[1] 3 2 2 3 2 1

This means that for rows 1 and 4, c3 should be updated; for rows 2, 3 and 5, c2 should be updated (among others). What is the cleanest way to achieve this in R using vectorized operations, i.e, without apply and friends? EDIT: And, if at all possible, without R loops?

I have thought about transforming d into a matrix and then address the matrix elements using an one-dimensional vector. But then I haven't found a clean way to compute the one-dimensional address from the row and column indexes.

like image 225
krlmlr Avatar asked Jun 05 '12 09:06

krlmlr


2 Answers

With your example data, and using only the first few rows (D and I below) you can easily do what you want via a matrix as you surmise.

set.seed(12008)
n <- 10000
d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- sample.int(3, n, replace=TRUE)
## just work with small subset
D <- head(d)
I <- head(i)

First, convert D into a matrix:

dmat <- data.matrix(D)

Next compute the indices of the vector representation of the matrix corresponding to rows and columns indicated by I. For this, it is easy to generate the row indices as well as the column index (given by I) using seq_along(I) which in this simple example is the vector 1:6. To compute the vector indices we can use:

(I - 1) * nrow(D) + seq_along(I)

where the first part ( (I - 1) * nrow(D) ) gives us the correct multiple of the number of rows (6 here) to index the start of the Ith column. We then add on the row index to get the index for the n-th element in the Ith column.

Using this we just index into dmat using "[", treating it like a vector. The replacement version of "[" ("[<-") allows us to do the replacement in a single line. Here I replace the indicated elements with NA to make it easier to see that the correct elements were identified:

> dmat
  c1 c2 c3
1  1  2  3
2  2  4  6
3  3  6  9
4  4  8 12
5  5 10 15
6  6 12 18
> dmat[(I - 1) * nrow(D) + seq_along(I)] <- NA
> dmat
  c1 c2 c3
1  1  2 NA
2  2 NA  6
3  3 NA  9
4  4  8 NA
5  5 NA 15
6 NA 12 18
like image 51
Gavin Simpson Avatar answered Sep 29 '22 15:09

Gavin Simpson


If you are willing to first convert your data.frame to a matrix, you can index elements-to-be-replaced using a two-column matrix. (Beginning with R-2.16.0, this will be possible with data.frames directly.) The indexing matrix should have row indices in its first column and column indices in its second column.

Here's an example:

## Create a subset of the your data
set.seed(12008); n  <- 6 
D  <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- seq_len(nrow(D))            # vector of row indices
j <- sample(3, n, replace=TRUE)  # vector of column indices 
ij <- cbind(i, j)                # a 2-column matrix to index a 2-D array 
                                 # (This extends smoothly to higher-D arrays.)  

## Convert it to a matrix    
Dmat <- as.matrix(D)

## Replace the elements indexed by 'ij'
Dmat[ij] <- NA
Dmat
#      c1 c2 c3
# [1,]  1  2 NA
# [2,]  2 NA  6
# [3,]  3 NA  9
# [4,]  4  8 NA
# [5,]  5 NA 15
# [6,] NA 12 18

Beginning with R-2.16.0, you will be able to use the same syntax for dataframes (i.e. without having to first convert dataframes to matrices).

From the R-devel NEWS file:

Matrix indexing of dataframes by two column numeric indices is now supported for replacement as well as extraction.

Using the current R-devel snapshot, here's what that looks like:

D[ij] <- NA
D
#   c1 c2 c3
# 1  1  2 NA
# 2  2 NA  6
# 3  3 NA  9
# 4  4  8 NA
# 5  5 NA 15
# 6 NA 12 18
like image 31
Josh O'Brien Avatar answered Sep 29 '22 16:09

Josh O'Brien