Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Split by "\n" or three spaces and retain at least one space when there are three spaces

Tags:

regex

split

r

I hope I can explain it so it's easy for you. I would need this as missing information in a string is marked as three spaces, and surprisingly doesn't perform an \n for the next piece of information.

Imagine I have a string like:

string <- "abc
   def
ghi   jkl"

I want the output of a regex expression (maybe with strsplit() with a more advanced function) to be:

[[1]]
[1] "abc" "" "def" "ghi" "" "jkl"

That splits when it finds a \n and that it splits and inserts a white space when it finds three spaces. I need to mark that missing info as another value. If not, that breaks my script, thinking that the next info is, for example, three spaces concatenated with the def string.

Thank you

like image 293
Geiser Avatar asked Dec 23 '15 13:12

Geiser


1 Answers

Here are two solutions which both use strsplit but differ in how they split:

1) split on newline Remove all newlines giving s1 and then add a newline after every third character giving s2. Split s2 on newlines and replace each occurrence of three consecutive spaces with the empty string.

Split <- function(string) {
  s1 <- gsub("\n", "", string)
  s2 <- gsub("(.{3})", "\\1\n", s1)
  spl <- strsplit(s2, "\n")
  lapply(spl, function(s) replace(s, s == "   ", ""))
}

# test
string <- "abc\n   def\nghi   jkl"
Split(string)
## [[1]]
## [1] "abc" ""    "def" "ghi" ""    "jkl"

2) split on zero width 3 char regexp Remove the newlines and split using the indicated regular expression. Finally replace each consecutive three spaces with the empty string.

Split2 <- function(string) {
  s1 <- gsub("\n", "", string)
  spl <- strsplit(s1, "(?<=...)", perl = TRUE)
  lapply(spl, function(s) replace(s, s == "   ", ""))
}

# test
string <- "abc\n   def\nghi   jkl"
Split2(string)
## [[1]]
## [1] "abc" ""    "def" "ghi" ""    "jkl"

Note: 1. Note that the other answers provided to this question do not work for the following input string (which has two empty fields in succession) but the answers here do correctly recognize two empty 3 character fields in succession after the abc field:

string2 <- "abc\n      def\nghi   jkl" # 6 spaces before d, 3 spaces before j

Split(string2)
## [[1]]
## [1] "abc" ""    ""    "def" "ghi" ""    "jkl"

Split2(string2)
## [[1]]
## [1] "abc" ""    ""    "def" "ghi" ""    "jkl"

Note 2: The two solutions above can also be nicely expressed using a magrittr pipeline:

library(magrittr)
string %>% 
  gsub(pattern = "\n", replacement = "") %>%
  gsub(pattern = "(.{3})", replacement = "\\1\n") %>%
  strsplit("\n") %>%
  lapply(function(s) replace(s, s == "   ", ""))

## [[1]]
## [1] "abc" ""    "def" "ghi" ""    "jkl"


library(magrittr)
string %>% 
  gsub(pattern = "\n", replacement = "") %>%
  strsplit("(?<=...)", perl = TRUE) %>%
  lapply(function(s) replace(s, s == "   ", ""))

## [[1]]
## [1] "abc" ""    "def" "ghi" ""    "jkl"
like image 73
G. Grothendieck Avatar answered Oct 01 '22 03:10

G. Grothendieck