Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete characters at positions within a string in R?

I am looking for a way to delete the characters at certain positions within a string in R. For example, if we have a string "1,2,1,1,2,1,1,1,1,2,1,1", I want to delete the third, fourth, 7th and 8th position. The operation would make the string: "1,1,2,1,1,1,1,2,1,1".

Unfortunately, breaking the string into a list using strsplit is not an option, because the strings I am working with are over 1 million characters long. Considering I have about 2,500 strings, it works out to be quite some time.

Alternatively, finding a way to replace the characters with an empty string "" would achieve the same purpose - I think. Looking into this line of thought, I came across this StackOverflow post:

R: How can I replace let's say the 5th element within a string?

Unfortunately, the solution suggested is hard to efficiently generalize and the following takes about 60 seconds per input string for a list of 2000 positions to remove:

subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
    memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}

Looking into the problem, I found a snippet of code, that seems to replace the characters at certain positions with "-":

subchar <- function(string, pos) {
        for(i in pos) {
            string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
        }
        return(string)
}

I don't quite understand regular expression (yet), but I have a strong suspicion something along these lines will be much better time-wise than the first code solution. Unfortunately, this subchar function seems to break when the values in pos gets high:

> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'

I was also considering trying to read in the string data into a table using SQL, but I was hoping that there would be a elegant string solution. The SQL implementation in R to do this seems rather complicated.

Any ideas? Thanks!

like image 760
Gordon Freeman Avatar asked Aug 21 '12 00:08

Gordon Freeman


People also ask

How do you get rid of a certain character in a string?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do you remove the first 3 characters on R?

Removing the first n characters To remove the string's first n characters, we can use the built-in substring() function in R. The substring() function accepts 3 arguments, the first one is a string , the second is start position, third is end position.

How do I remove leading character in R?

If we need to remove the first character, use sub , match one character ( . represents a single character), replace it with '' . Or for the first and last character, match the character at the start of the string ( ^. ) or the end of the string ( . $ ) and replace it with '' .


2 Answers

strsplit is more than ten times faster if you use fixed = TRUE. Rough extrapolation and it will take a little over 2 minutes to process your 2,500 strings of 1,000,000 comma separated integers.

N <- 1000000
x <- sample(0:1, N, replace = TRUE)
s <- paste(x, collapse = ",")

# this is a vector of 10 strings
M <- 10
S <- rep(s, M)

system.time(y <- strsplit(S, split = ","))
# user  system elapsed 
# 6.57    0.00    6.56 
system.time(y <- strsplit(S, split = ",", fixed = TRUE))
# user  system elapsed 
# 0.46    0.03    0.50

This is almost 3 times faster than using scan:

system.time(scan(textConnection(S), sep=",", what="a"))
# Read 10000000 items
# user  system elapsed 
# 1.21    0.09    1.42
like image 120
flodel Avatar answered Sep 20 '22 14:09

flodel


Read them in using scan(). You can set the separator to be "," and what="a". You can scan one "line" at a time with nlines=1 and if it is a textConnection, the "pipeline" will "remember" where it was as of the last read.

x <- paste( sample(0:1, 1000, rep=T), sep=",")
xin <- textConnection(x)

x995 <- scan(xin, sep=",", what="a", nmax=995)
# Read 995 items
x5 <- scan(xin, sep=",", what="a", nmax=995)
# Read 5 items

Here's an illustration with 5 "lines"

> x <- paste( rep( paste(sample(0:1, 50, rep=T), collapse=","),  5),  collapse="\n")
> str(x)
 chr "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0\n1,0,0,0,0,1,0,0,1,1,1,0,1,"| __truncated__
> xin <- textConnection(x)
> x1 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x2 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x3 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x4 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x5 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x6 <- scan(xin, sep=",", what="a", nlines=1)
Read 0 items
> length(x1)
[1] 50
> length(x1[-c(3,4,7,8)])
[1] 46
> paste(x1, collapse=",")
[1] "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0"
> 
like image 33
IRTFM Avatar answered Sep 21 '22 14:09

IRTFM