Strsplit and count occurrences

Question

Is there a way to split a string like this?

A1BG	AAAGGGCGTTCACCGG	2 A1BG	AAGATAGCATCCCACT	1

I would like to split by "\" in order to count how many genes are in the file where a gene is in this case A1BG and how many codes are where codes are for example AAAGGGCGTTCACCGG and AAGATAGCATCCCACT. My attempt below hasn't been successful.

strsplit(mydf, '\')[[1]]

Can anyone help me please?

Frank · Accepted Answer

It looks like you have a malformed TSV (tab-separated values) table. If you swap the spaces for newlines, you can read it in as a table and don't need to set up your own parsing rules:

x <- "A1BG	AAAGGGCGTTCACCGG	2 A1BG	AAGATAGCATCCCACT	1"
x2 <- gsub(" ", "
", x)

library(data.table)
DT = setnames(fread(x2), c("gene", "code", "num"))[]

#    gene             code num
# 1: A1BG AAAGGGCGTTCACCGG   2
# 2: A1BG AAGATAGCATCCCACT   1

Then you can count how many codes there are per gene like

DT[, .N, by=gene]
# or 
DT[, .(N = uniqueN(code)), by=gene]

#    gene N
# 1: A1BG 2

or similarly use dplyr's count and n_distinct functions.

Tim Biegeleisen · Answer

We can try matching on the regex pattern \b[ACGT]{16}\b, and then counting the number of matches in the input string:

x <- "A1BG	AAAGGGCGTTCACCGG	2 A1BG	AAGATAGCATCCCACT	1"
matches <- regmatches(x, gregexpr("\b[ACGT]{16}\b", x, perl=TRUE))[[1]]
length(matches)

[1] 2

If the number of base pairs in a gene might not be exactly 16, then try choosing a gene length which would result in the correct count in that case (e.g. between 10 and 20 base pairs).

Strsplit and count occurrences

Tags:

r

Elb

2 Answers

Frank

Tim Biegeleisen

Recent Activity

Donate For Us

Strsplit and count occurrences

Tags:

r

Elb

2 Answers

Frank

Tim Biegeleisen

Related questions

Recent Activity

Donate For Us