I have a big data set with around 500.000 rows. Each of them are strings. I would like to trim all rows to a fixed size.
I found this:
dt$rev <- strtrim(dt$rev, width=max_len)
However it takes too long. Is there a faster way?
This has nothing to do with data.table. It's just that strtrim()
is fairly slow.
As long as you're operating on single-width characters (i.e., characters that aren't, for instance, Chinese/Japanese/Korean), you can instead use substr()
, which is much faster.
## Make a long character vector with 5 million elements
x <- rep(state.name, 1e5)
## Speed comparison
system.time(substr(x, 1, 3))
# user system elapsed
# 0.43 0.00 0.44
system.time(strtrim(x, 3))
# user system elapsed
# 44.63 0.03 44.85
## Confirm that both methods return the same output
identical(substr(state.name,1,3), strtrim(state.name,3))
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With