I have a dataframe of 150,000 rows with 2,000 columns containing values, some being negatives. I am replacing those negative values by 0, but it is extremely slow to do so (~60min or more). <code>df[df < 0] = 0</code> where <code>df[,1441:1453]</code> looks like (all columns/values numeric): <pre class="prettyprint"><code> V1441 V1442 V1443 V1444 V1445 V1446 V1447 V1448 V1449 V1450 V1451 V1452 V1453 1 3 1 0 4 4 -2 0 3 12 5 17 34 27 2 0 1 0 7 0 0 0 1 0 0 0 0 0 3 0 2 0 1 2 3 6 1 2 1 -6 3 1 4 1 2 3 6 1 2 1 -6 3 1 -4 1 0 5 1 2 1 -6 3 1 -4 1 0 0 1 0 0 6 1 0 0 1 0 0 0 0 0 0 1 2 2 </code></pre> Is there a way to speed up such process, eg the way I am doing it is utterly slow, and there is faster approach for this ? Thanks.

Try transforming your df to a matrix. <pre class="prettyprint"><code>df <- data.frame(a=rnorm(1000),b=rnorm(1000)) m <- as.matrix(df) m[m<0] <- 0 df <- as.data.frame(m) </code></pre>

Both your original approach and the current answer create an object the same size as <code>m</code> (or <code>df</code>) when creating <code>m<0</code> (the matrix approach is quicker because there is less internal copying with <code>[<-</code> compared with <code>[<-.data.frame</code> You can use <code>lapply</code> and <code>replace</code>, then you are only looking at a vector or <code>length (nrow(df))</code> each time and not copying so much <pre class="prettyprint"><code>df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)}) </code></pre> The above code should be quite effiicent. If you use <code>data.table</code>, then most of the memory (and) time inefficiency of the <code>data.frame</code> approach is removed. It would be ideal for a large data situation like yours. <pre class="prettyprint"><code>library(data.table) # this really shouldn't be DT <- lapply(df, function(x){replace(x, x <0,0)}) # change to data.table setattr(DT, 'class', c('data.table','data.frame')) # or # DT <- as.data.table(df, function(x){replace(x, x <0,0)}) </code></pre> You could set keys on all the columns and then replacing by reference for key values less than 0

Fast replacing values in dataframe in R

Q: How do I replace specific values in a column in R?

replace() function in R Language is used to replace the values in the specified string vector x with indices given in list by those given in values. It takes on three parameters first is the list name, then the index at which the element needs to be replaced, and the third parameter is the replacement values.

Q: How do I change the value of dataset in R?

In the R Commander, you can click the Data set button to select a data set, and then click the Edit data set button. For more advanced data manipulation in R Commander, explore the Data menu, particularly the Data / Active data set and Data / Manage variables in active data set menus.

Q: How do I replace a column in a Dataframe in R?

To replace the character column of dataframe in R, we use str_replace() function of “stringr” package. Let's see how to replace the character column of dataframe in R with an example.

Tags:

I have a dataframe of 150,000 rows with 2,000 columns containing values, some being negatives. I am replacing those negative values by 0, but it is extremely slow to do so (~60min or more).

df[df < 0] = 0

where df[,1441:1453] looks like (all columns/values numeric):

  V1441 V1442 V1443 V1444 V1445 V1446 V1447 V1448 V1449 V1450 V1451 V1452 V1453
1     3     1     0     4     4    -2     0     3    12     5    17    34    27
2     0     1     0     7     0     0     0     1     0     0     0     0     0
3     0     2     0     1     2     3     6     1     2     1    -6     3     1
4     1     2     3     6     1     2     1    -6     3     1    -4     1     0
5     1     2     1    -6     3     1    -4     1     0     0     1     0     0
6     1     0     0     1     0     0     0     0     0     0     1     2     2

Is there a way to speed up such process, eg the way I am doing it is utterly slow, and there is faster approach for this ? Thanks.

581

asked Oct 11 '12 09:10

Benoit B.

2 Answers

Try transforming your df to a matrix.

df <- data.frame(a=rnorm(1000),b=rnorm(1000))
m <- as.matrix(df)
m[m<0] <- 0
df <- as.data.frame(m)

150

answered Oct 08 '22 17:10

Roland

Both your original approach and the current answer create an object the same size as m (or df) when creating m<0 (the matrix approach is quicker because there is less internal copying with [<- compared with [<-.data.frame

You can use lapply and replace, then you are only looking at a vector or length (nrow(df)) each time and not copying so much

df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})

The above code should be quite effiicent.

If you use data.table, then most of the memory (and) time inefficiency of the data.frame approach is removed. It would be ideal for a large data situation like yours.

library(data.table)
# this really shouldn't be 
DT <- lapply(df, function(x){replace(x, x <0,0)})
# change to data.table
setattr(DT, 'class', c('data.table','data.frame'))
# or 
# DT <- as.data.table(df, function(x){replace(x, x <0,0)})

You could set keys on all the columns and then replacing by reference for key values less than 0

answered Oct 08 '22 19:10

mnel

Related questions
                            
                                Use JSch to put a file to the remote directory and if the directory does not exist, then create it
                            
                                How to compare dates only (and not the time) in python
                            
                                Why is ARC complaining about dispatch_queue_create and dispatch_release in iOS 6?
                            
                                Equivalent code of CreateObject in C#
                            
                                companion object to a private class: why isn't it valid?
                            
                                Index of max and min value in an array
                            
                                Angularjs directive to replace text
                            
                                Meteor Subscribe doesn't update sort order of collection
                            
                                Java SimpleDateFormat Timezone offset with minute separated by colon
                            
                                Jquery Syntax error, unrecognized expression:
                            
                                How to animate CAShapeLayer path and fillColor
                            
                                Jensen-Shannon Divergence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With