I have a column as follows in a dataframe called PeakBoundaries:
chrom
chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509
I would like to separate out the columns so that the columns look like below in a dataframe:
chr chrStart chrEnd
chr11 69464719 69502928
chr7 55075808 55093954
chr8 128739772 128762863
chr3 169389459 169490555
etc.
I have tried a regular expression approach but am not getting anywhere in terms of getting the match to enter into a new column:
PeakBoundaries$chrOnly <- PeakBoundaries[grep("\\w+?=\\:"),PeakBoundaries$chrom]
I am met with the error:
Error in [.data.frame
(PeakBoundaries, grep("\w+?=\:"), PeakBoundaries$chrom) :
undefined columns selected
Try this - no regex needed, just the strsplit
function:
dat <- read.table(text="chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509", stringsAsFactors=FALSE)
dat[,2:4] <- matrix(unlist(strsplit(dat[,1],split = "\\:|\\-")), ncol=3, byrow=TRUE)
colnames(dat) <- c("chrom", "chr", "chrStart", "chrEnd")
# Convert last two columns from character to numeric:
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
Results
> res
chrom chr chrStart chrEnd
1 chr11:69464719-69502928 chr11 69464719 69502928
2 chr7:55075808-55093954 chr7 55075808 55093954
3 chr8:128739772-128762863 chr8 128739772 128762863
4 chr3:169389459-169490555 chr3 169389459 169490555
5 chr17:37848534-37877201 chr17 37848534 37877201
6 chr19:30306758-30316875 chr19 30306758 30316875
7 chr1:150496857-150678056 chr1 150496857 150678056
8 chr12:69183279-69260755 chr12 69183279 69260755
9 chr11:77610143-77641464 chr11 77610143 77641464
10 chr8:38191804-38260814 chr8 38191804 38260814
11 chr12:58135797-58156509 chr12 58135797 58156509
Edit
You could do everything using only your existing dataframe. Replace dat[,1]
with PeakBoundaries$chrom
and dat[,2:4]
with PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)]
and you should have it!
Edit By OP
OK so I think there's something a bit odd with my dataset but I've sorted it with Dominic's help so that it it is now:
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries <- PeakBoundaries[-1,]
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries$PeakBoundaries <-
as.character(PeakBoundaries$PeakBoundaries)
PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] <-
matrix(unlist(strsplit(PeakBoundaries$PeakBoundaries,
split = "\\:|\\-")), ncol=3, byrow=TRUE)
A shorter version of Dominic's answer, making the insertion a one-liner:
dat <- data.frame(chrom = readLines(textConnection("chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509")) )
dat[, c('chr','chrStart','chrEnd')] <- t( sapply( dat$chrom, function(s) { str_split(s, '[:-]') [[1]] } ) )
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With