Is there a way to split a string like this?
A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1
I would like to split by "\" in order to count how many genes are in the file where a gene is in this case A1BG and how many codes are where codes are for example AAAGGGCGTTCACCGG and AAGATAGCATCCCACT. My attempt below hasn't been successful.
strsplit(mydf, '\')[[1]]
Can anyone help me please?
It looks like you have a malformed TSV (tab-separated values) table. If you swap the spaces for newlines, you can read it in as a table and don't need to set up your own parsing rules:
x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
x2 <- gsub(" ", "\n", x)
library(data.table)
DT = setnames(fread(x2), c("gene", "code", "num"))[]
# gene code num
# 1: A1BG AAAGGGCGTTCACCGG 2
# 2: A1BG AAGATAGCATCCCACT 1
Then you can count how many codes there are per gene like
DT[, .N, by=gene]
# or
DT[, .(N = uniqueN(code)), by=gene]
# gene N
# 1: A1BG 2
or similarly use dplyr's count and n_distinct functions.
We can try matching on the regex pattern \b[ACGT]{16}\b, and then counting the number of matches in the input string:
x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
matches <- regmatches(x, gregexpr("\\b[ACGT]{16}\\b", x, perl=TRUE))[[1]]
length(matches)
[1] 2
If the number of base pairs in a gene might not be exactly 16, then try choosing a gene length which would result in the correct count in that case (e.g. between 10 and 20 base pairs).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With