How to vectorize R strsplit?

Tags:

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?

For example, to count the lengths of words in a character vector:

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

726

asked Jun 16 '10 15:06

James

1 Answers

In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

> nchar(words)
[1] 1 5 5 3

More generally, take advantage of the fact that strsplit returns a list and use lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

Or else use an l*ply family function from plyr. For instance:

> laply(strsplit(words,""), length)
[1] 1 5 5 3

Edit:

In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

Now that I have all the words, we can do our counts:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03

The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

Apparently the latest version of plyr is faster (this is using a slightly older version).

191

answered Sep 21 '22 21:09

Shane

Related questions
                            
                                In ggplot restrict y to be >0 in LOESS
                            
                                Image analysis in R
                            
                                R: two scatterplots on single graph using ggplot
                            
                                Multivariate Linear Mixed Model in lme4
                            
                                Avoid argument duplication passed through (...)
                            
                                R: change background color of plot for specific area only (based on x-values)
                            
                                How is 95% CI calculated using confint in R?
                            
                                TwitteR, ROAuth and Windows: register OK, but certificate verify failed
                            
                                RStudio Shiny Conditional Plot
                            
                                How to get environment of a variable in R
                            
                                Generate a Filled geom_step
                            
                                How to change the position of the table of contents in rmarkdown?
                            
                                python equivalent of get() in R (= use string to retrieve value of symbol)
                            
                                Dynamic plot height in Shiny
                            
                                Alternative to R's `memory.size()` in linux?
                            
                                Complexe non-equi merge in R
                            
                                ggplot 'non-finite values' error
                            
                                Running R scripts in Airflow?
                            
                                Efficiently merging large data.tables [duplicate]
                            
                                R cannot read Python Pandas dataframe saved in feather format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to vectorize R strsplit?

Tags:

r

vectorization

strsplit

James

People also ask

1 Answers

Shane

Recent Activity

Donate For Us