Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining lines in character vector in R

Tags:

regex

text

r

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:

[1] hello,
[2] world
[3] ""
[4] how
[5] are 
[6] you
[7] ""

I would like to combine the lines so that I have something that looks like this:

[1] hello, world
[2] how are you

I have tried to write a for loop:

for(i in 1:length(content)){
    if(content[i+1] != ""){
        content[i+1] <- c(content[i], content[i+1])
    }
}  

But when I run the loop, I get an error: missing value where TRUE/FALSE needed.

Can anyone suggest a better way to do this, maybe not even using a loop?

Thanks!

EDIT: I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?

like image 339
dc3 Avatar asked Dec 07 '25 02:12

dc3


2 Answers

you don't need a loop to do that

x <- c("hello,", "world", "", "how", "\nare", "you", "")

dummy <- paste(
  c("\n", sample(letters, 20, replace = TRUE), "\n"), 
  collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
like image 179
Thierry Avatar answered Dec 08 '25 16:12

Thierry


I think there are more elegant solutions, but this might be usable for you:

chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")

#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])

#paste the groups together
res <- sapply(splitted,paste, collapse=" ")

#remove names(if necessary, probably not)
res <- unname(res) #thanks @Roland

> res
[1] "hello, world" "how are you"
like image 23
Heroka Avatar answered Dec 08 '25 16:12

Heroka



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!