<p>I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.</p> <p>The following using the <code>stringr</code> function <code>str_split_fixed</code> works well but is very slow.</p> <pre class="prettyprint"><code>library(data.table) library(stringr) df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40)) df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '') df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '') st1 <- str_split_fixed(df1$combCol2, ',', 2) </code></pre> <p>Any suggestions for a faster way to do this?</p>

<h3>Update</h3> <p>The <code>stri_split_fixed</code> function in more recent versions of "stringi" have a <code>simplify</code> argument that can be set to <code>TRUE</code> to return a matrix. Thus, the updated solution would be:</p> <pre class="prettyprint"><code>stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE) </code></pre> <hr> <h3>Original answer (with updated benchmarks)</h3> <p>If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:</p> <pre class="prettyprint"><code>library(stringr) library(stringi) system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2)) # user system elapsed # 3.25 0.00 3.25 system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))) # user system elapsed # 0.04 0.00 0.05 system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)) # user system elapsed # 0.01 0.00 0.01 </code></pre> <p>Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.</p> <hr> <p>Here's how it compares with @RichardScriven's suggestion in the comments:</p> <pre class="prettyprint"><code>fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)) fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE) fun2 <- function() { do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE)) } library(microbenchmark) microbenchmark(fun1a(), fun1b(), fun2(), times = 10) # Unit: milliseconds # expr min lq mean median uq max neval # fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10 # fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10 # fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10 </code></pre>

R: Fast string split on first delimiter occurence

I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.

The following using the stringr function str_split_fixed works well but is very slow.

library(data.table)
library(stringr)

df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')

st1 <- str_split_fixed(df1$combCol2, ',', 2)

Any suggestions for a faster way to do this?

How do you split a string on the first occurrence of certain characters?

To split a JavaScript string only on the first occurrence of a character, call the slice() method on the string, passing it the index of the character + 1 as a parameter. The slice method will return the portion of the string after the first occurrence of the character.

How do you get the first element to split?

To split a string and get the first element of the array, call the split() method on the string, passing it the separator as a parameter, and access the array element at index 0 . For example, str. split(',')[0] splits the string on each comma and returns the first array element. Copied!

How do you split a string into parts based on a delimiter?

You can use the split() method of String class from JDK to split a String based on a delimiter e.g. splitting a comma-separated String on a comma, breaking a pipe-delimited String on a pipe, or splitting a pipe-delimited String on a pipe.

How do I split a string in delimiter in R?

Use str_split to Split String by Delimiter in R Alternatively, the str_split function can also be utilized to split string by delimiter. str_split is part of the stringr package. It almost works in the same way as strsplit does, except that str_split also takes regular expressions as the pattern.

Update

The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:

stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)

Original answer (with updated benchmarks)

If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:

library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
#    user  system elapsed 
#    3.25    0.00    3.25 
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
#    user  system elapsed 
#    0.04    0.00    0.05 
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
#    user  system elapsed 
#    0.01    0.00    0.01

Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.

Here's how it compares with @RichardScriven's suggestion in the comments:

fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
  do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
                            invert = TRUE))
} 

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  fun1a()  42.72647  46.35848  59.56948  51.94796  69.29920  98.46330    10
#  fun1b()  17.55183  18.59337  20.09049  18.84907  22.09419  26.85343    10
#   fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912    10

R: Fast string split on first delimiter occurence

Tags:

string

regex

split

r

screechOwl

People also ask

1 Answers

Update

Original answer (with updated benchmarks)

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us

R: Fast string split on first delimiter occurence

Tags:

string

regex

split

r

screechOwl

People also ask

1 Answers

Update

Original answer (with updated benchmarks)

A5C1D2H2I1M1N2O1R2T1

Related questions

Recent Activity

Donate For Us