Direct update (replace) of sparse data frame is slow and inefficient

Tags:

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.

This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.

I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:

timeMe <- function() {
  set.seed(1)
  names = paste0("A", seq(1:1200))

  # try with a data frame
  # outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
  # try with data table
  outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))

  for(i in seq(100)) {
    # generate 100 columns (real data is in json)
    sparse.cols <- sample(1200, 100)
    # Each record is coming in as a list
    # Each column is either a character, logical, or numeric
    sparse.val <- lapply(sparse.cols, function(i) {
      if(i < 401) {  # logical
        sample(c(TRUE, FALSE), 1) 
      } else if (i < 801) {  # numeric
        sample(seq(10), 1)
      } else { # character
        sample(LETTERS, 1)
      }
    })  # now we have a list with values to populate
    names(sparse.val) <- paste0("A", sparse.cols)

    # and here is the challenge and what takes a long time.
    # want to assign the ith row and the named column with each value
    for(x in names(sparse.val)) {
      val=sparse.val[[x]]
      # this is where the bottleneck is.
      # for data frame
      # outdf[i, x] <- val
      # for data table
      outdf[i, x:=val]
    }
  }  
  outdf
}

I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.

For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.

782

asked Jun 05 '14 20:06

jjacobs

1 Answers

I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).

First, the creation of `data.table`:

Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.

Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:

rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
    if (i < 401L) rep(NA, rows)
    else if (i >= 402L & i < 801L) rep(NA_real_, rows)
    else rep(NA_character_, rows)
}))

Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?

Second, doing `names(.) <-`:

The line:

names(sparse.val) <- paste0("A", sparse.cols)

will create a copy and is not really necessary. Therefore we'll delete this line.

Third, the time consuming for-loop:

for(x in names(sparse.val)) {
    val=sparse.val[[x]]
    outdf[i, x:=val]
}

is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.

This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.

for(x in names(sparse.val)) {
    val=sparse.val[[x]]
    outdf[i, (x) := val]
}

Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.

Coming back to the optimisation, the good news is, your time consuming for-loop is simply:

set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)

This is where data.table's sub-assign by reference feature comes in handy!

Putting it all together:

Your final function looks like this:

timeMe2 <- function() {
    set.seed(1L)

    rows = 100L
    cols = 1200L
    outdf <- as.data.table(lapply(seq_len(cols), function(i) {
        if (i < 401L) rep(NA, rows)
        else if (i >= 402L & i < 801L) rep(NA_real_, rows)
        else sample(rep(NA_character_, rows))
    }))
    setnames(outdf, paste0("A", seq(1:1200)))

    for(i in seq(100)) {
        sparse.cols <- sample(1200L, 100L)
        sparse.val <- lapply(sparse.cols, function(i) {
            if(i < 401L) sample(c(TRUE, FALSE), 1) 
            else if (i >= 402 & i < 801L) sample(seq(10), 1)
            else sample(LETTERS, 1)
        })
        set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
    }  
    outdf
}

By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.

HTH

answered Nov 07 '22 05:11

Arun

Related questions
                            
                                R data - Changing my data frame (converting columns into rows and vice versa)
                            
                                Heatmap or plot for a correlation matrix [duplicate]
                            
                                Generate ggplot2 boxplot with different colours for multiple groups
                            
                                How to spatially separate rug plots from different series
                            
                                Change portion of the background in ggplot be to a different color [duplicate]
                            
                                How NOT to display value 0 in a stacked bar chart using ggplot2
                            
                                How to plot density curves for each column in R?
                            
                                Length of columns excluding NA in r
                            
                                boxplot of vectors with different length
                            
                                Sort vector of integers in specific (custom) order
                            
                                R - fastest way to select the rows of a matrix that satisfy multiple conditions
                            
                                How to obtain all combinations of the columns of a data frame taken by 2?
                            
                                Complex rearrangement of list into matrix
                            
                                r boxplot tilted labels x axis
                            
                                rearrange a data frame by sorting a column within groups
                            
                                Trouble understanding how stack() works
                            
                                Test for Multicollinearity in Panel Data R
                            
                                Combining polygons and calculating their area (i.e. number of cells) in R
                            
                                How to get all the sum in aggregate function?
                            
                                Converting spatial polygon to regular data frame without use of gpclib tools

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Direct update (replace) of sparse data frame is slow and inefficient

Tags:

r

data.table

dplyr

jjacobs

People also ask

1 Answers

First, the creation of `data.table`:

Second, doing `names(.) <-`:

Third, the time consuming for-loop:

Putting it all together:

Arun

Recent Activity

Donate For Us

Direct update (replace) of sparse data frame is slow and inefficient

Tags:

r

data.table

dplyr

jjacobs

People also ask

1 Answers

First, the creation of data.table:

Second, doing names(.) <-:

Third, the time consuming for-loop:

Putting it all together:

Arun

Related questions

Recent Activity

Donate For Us

First, the creation of `data.table`:

Second, doing `names(.) <-`: