Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove common parts of strings in a character vector in R?

Tags:

string

regex

r

Assume a character vector like the following

file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt

Desired output:

file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw

I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.

The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.

So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,

mf <- function(x,sep){
    xsplit = strsplit(x,split = sep)
    xdfm <- as.data.frame(do.call(rbind,xsplit))
    res <- list()
    for (i in 1:ncol(xdfm)){
        if (!all(xdfm[,i] == xdfm[1,i])){
            res[[length(res)+1]] <- as.character(xdfm[,i])
        }
    }
    res <- as.data.frame(do.call(rbind,res))
    res <- apply(res,2,function(x) paste(x,collapse="_"))
    return(res)
}

Applying the above function:

 a = c("a_samples.txt","b_samples.txt")
 mf(a,"_")
  V1  V2
 "a" "b"

2.

> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
      V1       V2
 "apple" "orange"

If the resulting split parts are not same length, this doesn't work.

like image 870
Veera Avatar asked Sep 16 '25 12:09

Veera


2 Answers

What about

files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files

... which yields

[1] "file1_p1_analysed"    "file1_p1_raw"         "f2_file2_p1_analysed" "f3_file3_p1_raw"     

This removes the _samples.txt part from your strings.

like image 113
Jan Avatar answered Sep 18 '25 05:09

Jan


Why not:

strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")

sapply(strings, function(x) {
  pattern <- ".*(file[0-9].*)_samples\\.txt"
  gsub(x, pattern = pattern, replacement = "\\1")
})

Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!

Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

like image 31
Erik Schutte Avatar answered Sep 18 '25 04:09

Erik Schutte