I have a dataframe of 150,000 rows with 2,000 columns containing values, some being negatives. I am replacing those negative values by 0, but it is extremely slow to do so (~60min or more).
df[df < 0] = 0
where df[,1441:1453]
looks like (all columns/values numeric):
V1441 V1442 V1443 V1444 V1445 V1446 V1447 V1448 V1449 V1450 V1451 V1452 V1453
1 3 1 0 4 4 -2 0 3 12 5 17 34 27
2 0 1 0 7 0 0 0 1 0 0 0 0 0
3 0 2 0 1 2 3 6 1 2 1 -6 3 1
4 1 2 3 6 1 2 1 -6 3 1 -4 1 0
5 1 2 1 -6 3 1 -4 1 0 0 1 0 0
6 1 0 0 1 0 0 0 0 0 0 1 2 2
Is there a way to speed up such process, eg the way I am doing it is utterly slow, and there is faster approach for this ? Thanks.
replace() function in R Language is used to replace the values in the specified string vector x with indices given in list by those given in values. It takes on three parameters first is the list name, then the index at which the element needs to be replaced, and the third parameter is the replacement values.
In the R Commander, you can click the Data set button to select a data set, and then click the Edit data set button. For more advanced data manipulation in R Commander, explore the Data menu, particularly the Data / Active data set and Data / Manage variables in active data set menus.
To replace the character column of dataframe in R, we use str_replace() function of “stringr” package. Let's see how to replace the character column of dataframe in R with an example.
Try transforming your df to a matrix.
df <- data.frame(a=rnorm(1000),b=rnorm(1000))
m <- as.matrix(df)
m[m<0] <- 0
df <- as.data.frame(m)
Both your original approach and the current answer create an object the same size as m
(or df
) when creating m<0
(the matrix approach is quicker because there is less internal copying with [<-
compared with [<-.data.frame
You can use lapply
and replace
, then you are only looking at a vector or length (nrow(df))
each time
and not copying so much
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
The above code should be quite effiicent.
If you use data.table
, then most of the memory (and) time inefficiency of the data.frame
approach is removed. It would be ideal for a large data situation like yours.
library(data.table)
# this really shouldn't be
DT <- lapply(df, function(x){replace(x, x <0,0)})
# change to data.table
setattr(DT, 'class', c('data.table','data.frame'))
# or
# DT <- as.data.table(df, function(x){replace(x, x <0,0)})
You could set keys on all the columns and then replacing by reference for key values less than 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With