I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp
and jq
, and then again after ez
. I would like to define a function which "fills the gaps" with the character string #
, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame
Start with your vec
, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #
Here is a base R
option with split
. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split
the original vector by this grouping variable into a list
('lst'). For those list
elements that don't have two (maximum length) elements are appended with NA
at the end by assignment with length<-
. Then, rbind
, the list
elements into a two column matrix
. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame
if we need a data.frame
output
You can use Base R:
First Collapse the vector into a string while replaceing #
where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
toString
,
ie [a-z],\s*[a-z]
or at the end ie |$
you insert an #
.#
and read in the data as a tableYou can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Another base
possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec
starts with #
, and negate it: !grepl("^#", vec)
; creates a logical vector.
Create a grouping variable by applying cumsum
to the logical vector (note: 1 & 2 similar to @akrun).
Use tapply
to apply a function to each subset of vec
, defined by the grouping variable. Check if
the length
is 1
. If so, pad by a trailing #
, else
just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table
. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With