Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regular expression to split string column into multiple columns

I have a column as follows in a dataframe called PeakBoundaries:

           chrom
 chr11:69464719-69502928
 chr7:55075808-55093954
 chr8:128739772-128762863
 chr3:169389459-169490555
 chr17:37848534-37877201
 chr19:30306758-30316875
 chr1:150496857-150678056
 chr12:69183279-69260755
 chr11:77610143-77641464
 chr8:38191804-38260814
 chr12:58135797-58156509

I would like to separate out the columns so that the columns look like below in a dataframe:

chr       chrStart           chrEnd
chr11     69464719         69502928
chr7      55075808         55093954
chr8      128739772        128762863
chr3      169389459        169490555

etc.

I have tried a regular expression approach but am not getting anywhere in terms of getting the match to enter into a new column:

 PeakBoundaries$chrOnly <- PeakBoundaries[grep("\\w+?=\\:"),PeakBoundaries$chrom]

I am met with the error: Error in [.data.frame(PeakBoundaries, grep("\w+?=\:"), PeakBoundaries$chrom) : undefined columns selected

like image 876
Sebastian Zeki Avatar asked Oct 17 '25 16:10

Sebastian Zeki


2 Answers

Try this - no regex needed, just the strsplit function:

dat <- read.table(text="chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509", stringsAsFactors=FALSE)

dat[,2:4] <- matrix(unlist(strsplit(dat[,1],split = "\\:|\\-")), ncol=3, byrow=TRUE)

colnames(dat) <- c("chrom", "chr", "chrStart", "chrEnd")

# Convert last two columns from character to numeric:

dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)

Results

> res

                      chrom   chr  chrStart    chrEnd
1   chr11:69464719-69502928 chr11  69464719  69502928
2    chr7:55075808-55093954  chr7  55075808  55093954
3  chr8:128739772-128762863  chr8 128739772 128762863
4  chr3:169389459-169490555  chr3 169389459 169490555
5   chr17:37848534-37877201 chr17  37848534  37877201
6   chr19:30306758-30316875 chr19  30306758  30316875
7  chr1:150496857-150678056  chr1 150496857 150678056
8   chr12:69183279-69260755 chr12  69183279  69260755
9   chr11:77610143-77641464 chr11  77610143  77641464
10   chr8:38191804-38260814  chr8  38191804  38260814
11  chr12:58135797-58156509 chr12  58135797  58156509

Edit

You could do everything using only your existing dataframe. Replace dat[,1] with PeakBoundaries$chrom and dat[,2:4] with PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] and you should have it!

Edit By OP

OK so I think there's something a bit odd with my dataset but I've sorted it with Dominic's help so that it it is now:

  PeakBoundaries <- as.data.frame(PeakBoundaries)
  PeakBoundaries <- PeakBoundaries[-1,]
  PeakBoundaries <- as.data.frame(PeakBoundaries)
  PeakBoundaries$PeakBoundaries <- 
             as.character(PeakBoundaries$PeakBoundaries)
  PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] <- 
             matrix(unlist(strsplit(PeakBoundaries$PeakBoundaries,
                                    split = "\\:|\\-")), ncol=3, byrow=TRUE)
like image 73
Dominic Comtois Avatar answered Oct 20 '25 05:10

Dominic Comtois


A shorter version of Dominic's answer, making the insertion a one-liner:

dat <- data.frame(chrom = readLines(textConnection("chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509")) )

dat[, c('chr','chrStart','chrEnd')] <- t( sapply( dat$chrom, function(s) { str_split(s, '[:-]') [[1]] } ) )

dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
like image 29
smci Avatar answered Oct 20 '25 05:10

smci



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!