I have the following simple script that tries to count the tag encoded with "CB:Z" in SAM/BAM file:
samtools view -h small.bam | grep "CB:Z:" |
sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
sort |
uniq -c |
awk '{print $2 " " $1}'
Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.
This line sed 's/.*CB:Z:\([ACGT]*\).*/\1/'
is very time consuming.
How can I speed it up?
The reason I used the Regex is that the "CB" tag column-wise position is not fixed. Sometimes it's at column 20 and sometimes column 21.
Example BAM file can be found HERE.
Update
Speed comparison on complete 40 million lines file:
My initial code:
real 21m47.088s
user 26m51.148s
sys 1m27.912s
James Brown's with AWK:
real 1m28.898s
user 2m41.336s
sys 0m6.864s
James Brown's with MAWK:
real 1m10.642s
user 1m41.196s
sys 0m6.484s
Another awk, pretty much like @tripleee's, I'd assume:
$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) { # use match for the regex match
a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5
}
END {
for(i in a)
print i,a[i] # sample output,tweak it to your liking
}'
Sample output:
...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...
Notice: Your sed 's/.*CB:Z:...
matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)...
matches the first.
Notice 2: Quoting @Sundeep in the comments: - - using LC_ALL=C mawk '..'
will give even better speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With