Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - convert BIG table into matrix by column names

This is an extension to an existing question: Convert table into matrix by column names

I am using the final answer: https://stackoverflow.com/a/2133898/1287275

The original CSV file matrix has about 1.5M rows with three columns ... row index, column index, and a value. All numbers are long integers. The underlying matrix is a sparse matrix about 220K x 220K in size with an average of about 7 values per row.

The original read.table works just fine.

  x <- read.table("/users/wallace/Hadoop_Local/reference/DiscoveryData6Mo.csv", header=TRUE);

My problem comes when I do the reshape command.

  reshape(x, idvar="page_id", timevar="reco", direction="wide")

The CPU hits 100% and there it sits forever. The machine (a mac) has more memory than R is using. I don't see why it should take so long to construct a sparse matrix.

I am using the default matrix package. I haven't installed anything extra. I just downloaded R a few days ago, so I should have the latest version.

Suggestions?

Thanks, Wallace

like image 684
Wallace Avatar asked Mar 23 '12 01:03

Wallace


People also ask

How do I turn a table into a matrix in R?

To convert a table into matrix in R, we can use apply function with as. matrix. noquote function.

How do I convert a dataset to a matrix in R?

Convert a Data Frame into a Numeric Matrix in R Programming – data. matrix() Function. data. matrix() function in R Language is used to create a matrix by converting all the values of a Data Frame into numeric mode and then binding them as a matrix.

Can a matrix have column names in R?

We use colnames() function for renaming the matrix column in R. It is quite simple to use the colnames() function. If you want to know more about colnames() function, then you can get help about it in R Studio using the command help(colnames) or ? colnames().


1 Answers

I would use the sparseMatrix function from the Matrix package. The typical usage is sparseMatrix(i, j, x) where i, j, and x are three vectors of same length: respectively, the row indices, col indices, and values of the non-zero elements in the matrix. Here is an example where I have tried to match variable names and dimensions to your specifications:

num.pages <- 220000
num.recos <- 230000
N         <- 1500000

df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))
head(df)
#   page_id   reco     value
# 1   33688  48648 0.3141030
# 2   78750 188489 0.5591290
# 3  158870  13157 0.2249552
# 4   38492  56856 0.1664589
# 5   70338 138006 0.7575681
# 6  160827  68844 0.8375410

library("Matrix")
mat <- sparseMatrix(i = df$page_id,
                    j = df$reco,
                    x = df$value,
                    dims = c(num.pages, num.recos))
like image 188
flodel Avatar answered Oct 19 '22 01:10

flodel