Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacing vector by regular pattern

I have a vector

vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")

which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).

However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:

 [1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"

which I would then convert to a data frame

   V1  V2
1  ab  #4
2  gw  #29
3  mp  #
4  jq  #35
5  ez  #

My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.


My Solution

greplSpace <- function(pattern, replacement, x){

  j <- 1

  while( j < length(x) ){
    if(grepl(pattern, x[j+1]) ){
      j <- j+2 
    } else {
      x <- c( x[1:j], replacement, x[(j+1):length(x)] )
      j <- j+2
    }
  }

  if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }

  return(x)
}

library(magrittr)

vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")

vec %>% greplSpace("#", "#", . ) %>% 
        matrix(ncol = 2, byrow = TRUE) %>%
        as.data.frame
like image 769
owen88 Avatar asked May 01 '18 06:05

owen88


4 Answers

Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.

library(dplyr)
library(tidyr)
library(stringr)

vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")

dat <- data_frame(Value = vec)

dat2 <- dat %>%
  mutate(String = !str_detect(vec, "#"),
         Key = ifelse(String, "V1", "V2"),
         Row = cumsum(String)) %>%
  select(-String) %>%
  spread(Key, Value, fill = "#") %>%
  select(-Row)

dat2
# # A tibble: 5 x 2
#   V1    V2   
#   <chr> <chr>
# 1 ab    #4   
# 2 gw    #29  
# 3 mp    #    
# 4 jq    #35  
# 5 ez    #   
like image 120
www Avatar answered Nov 05 '22 20:11

www


Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #

lst <-  split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though 
out
#  [,1] [,2] 
#1 "ab" "#4" 
#2 "gw" "#29"
#3 "mp" "#"  
#4 "jq" "#35"
#5 "ez" "#"  

Wrap it with as.data.frame if we need a data.frame output

like image 1
akrun Avatar answered Nov 05 '22 20:11

akrun


You can use Base R:

First Collapse the vector into a string while replaceing # where needed. Then just read using read.csv

vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
  V1  V2
1 ab  #4
2 gw #29
3 mp   #
4 jq #35
5 ez   #

Explanation:

  • First collapse the vector into a string by toString
  • Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
  • Then create line breaks after numbers or # and read in the data as a table

You can also do:

a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
  V1   V2
1 ab   #4
2 gw  #29
3 mp <NA>
4 jq  #35
5 ez <NA>

 data.frame(replace(as.matrix(a),is.na(a),"#"))
  V1   V2
1 ab   #4
2 gw  #29
3 mp    #
4 jq  #35
5 ez    #
like image 1
KU99 Avatar answered Nov 05 '22 19:11

KU99


Another base possibility:

do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
  if(length(x) == 1) c(x, "#") else x}))

#   [,1] [,2] 
# 1 "ab" "#4" 
# 2 "gw" "#29"
# 3 "mp" "#"  
# 4 "jq" "#35"
# 5 "ez" "#"

Explanation:

  1. Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.

  2. Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to @akrun).

  3. Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x

  4. Bind the resulting list together by row: do.call(rbind,


Another one:

# create a row index 
ri <- cumsum(!grepl("^#", vec))

# create a column index
ci <- ave(ri, ri, FUN  = seq_along)

# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)

# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec

# replace NA with '#'
m[is.na(m)] <- "#"

Using data.table. Create a grouping variable as above, and reshape from long to wide.

library(data.table)
d <- data.table(vec) 
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
#    g  1   2
# 1: 1 ab  #4
# 2: 2 gw #29
# 3: 3 mp   #
# 4: 4 jq #35
# 5: 5 ez   # 
like image 1
Henrik Avatar answered Nov 05 '22 21:11

Henrik