Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Removing Whitespace + Delimiter

I'm fairly new to the R language. So I have this vector containing the following:

> head(sampleVector)

[1] "| txt01 |   100 |         200 |       123.456 |           0.12345 |"
[2] "| txt02 |   300 |         400 |       789.012 |           0.06789 |"

I want to extract the lines and break each into separate pieces, with a data value per piece. I want to get a list resultListthat eventually would print out the following:

> head(resultList)`

[[1]]`  
[1] ""   "txt01"    "100"       "200"     "123.456"        "0.12345" 

[[2]]`  
[1] ""   "txt02"    "300"       "400"     "789.012"        "0.06789"

I am struggling with the strsplit() notation and I have tried and got the following code so far:

resultList  <- strsplit(sampleVector,"\\s+[|] | [|]\\s+ | [\\s+]")`          
#would give me the following output`

# [[1]]`    
# [1] "| txt01"    "100"       "200"     "123.456"        "0.12345 |" 

Anyway I can get the output the one strsplit call? I am guessing my notation to distinguish the delimiter + whitespace is wrong. Any help on this would be good.

like image 396
12341234 Avatar asked Feb 11 '23 18:02

12341234


2 Answers

Another strsplit option which I nearly missed:

strsplit(test,"[| ]+")
#[[1]]
#[1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
# 
#[[2]]
#[1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

...and my original answer because regmatches is my favourite function of late:

regmatches(test,gregexpr("[^| ]+",test))
#[[1]]
#[1] "txt01"   "100"     "200"     "123.456" "0.12345"
#
#[[2]]
#[1] "txt02"   "300"     "400"     "789.012" "0.06789"

To break it down as requested:

[| ]+ is a regex searching for single or repeated instances + of a space  or a pipe |
[^| ]+ is a regex searching for single or repeated instances + of any character not ^ a space  or a pipe |
gregexpr finds all the instances of this pattern and returns the start locations and length of the matching patterns.
regmatches extracts all the patterns from test that are matched by gregexpr

like image 97
thelatemail Avatar answered Feb 24 '23 14:02

thelatemail


Here's one way. This first removes the | from the vector with gsub. Then it uses strsplit on the spaces (or any number of spaces). Probably a bit easier that way.

strsplit(gsub("|", "", sampleVector, fixed=TRUE), "\\s+")
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"

Here's an interesting alternative using scan that might be useful, and will probably be quite fast.

lapply(sampleVector, function(y) {
    s <- scan(text = y, what = character(), sep = "|", quiet = TRUE)
    (g <- gsub("\\s+", "", s))[-length(g)]
})
# [[1]]
# [1] ""        "txt01"   "100"     "200"     "123.456" "0.12345"
#
# [[2]]
# [1] ""        "txt02"   "300"     "400"     "789.012" "0.06789"
like image 25
Rich Scriven Avatar answered Feb 24 '23 13:02

Rich Scriven