R: Removing Whitespace + Delimiter

Question

I'm fairly new to the R language. So I have this vector containing the following:

> head(sampleVector)

[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"

I want to extract the lines and break each into separate pieces, with a data value per piece. I want to get a list resultListthat eventually would print out the following:

> head(resultList)`

[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 

[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"

I am struggling with the strsplit() notation and I have tried and got the following code so far:

resultList  <- strsplit(sampleVector,"\s+[|] | [|]\s+ | [\s+]")`          
#would give me the following output`

# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |"

Anyway I can get the output the one strsplit call? I am guessing my notation to distinguish the delimiter + whitespace is wrong. Any help on this would be good.

thelatemail · Accepted Answer

Another strsplit option which I nearly missed:

strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

...and my original answer because regmatches is my favourite function of late:

regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"

To break it down as requested:

[| ]+ is a regex searching for single or repeated instances + of a space or a pipe |
[^| ]+ is a regex searching for single or repeated instances + of any character not ^ a space or a pipe |
gregexpr finds all the instances of this pattern and returns the start locations and length of the matching patterns.
regmatches extracts all the patterns from test that are matched by gregexpr

Rich Scriven · Answer

Here's one way. This first removes the | from the vector with gsub. Then it uses strsplit on the spaces (or any number of spaces). Probably a bit easier that way.

strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

Here's an interesting alternative using scan that might be useful, and will probably be quite fast.

lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

R: Removing Whitespace + Delimiter

Tags:

r

delimiter

strsplit

12341234

2 Answers

thelatemail

Rich Scriven

Recent Activity

Donate For Us

R: Removing Whitespace + Delimiter

Tags:

r

delimiter

strsplit

12341234

2 Answers

thelatemail

Rich Scriven

Related questions

Recent Activity

Donate For Us