I am trying to replace three letter code at the end of a sequence with nothing (basically removing) with sed but is not working well for multiple regex pattern. Here is an example of sequences
GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG
GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAA
GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTGA
When I try to use regex individually with sed it works
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" | sed 's/TAG$//'
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAA" | sed 's/TAA$//'
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" | sed 's/TAG$//'
However when I try to include multiple regex it doesn't work
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" |
sed 's/(TAG$|TAA$|TGA$)//'
Could somebody point to me where I am doing wrong?
You need to use extended regex switch in sed:
sed -r 's/(TAG|TAA|TGA)$//'
OR on OSX:
sed -E 's/(TAG|TAA|TGA)$//'
Or this sed without extended regex (doesn't work on OSX though):
sed 's/\(TAG\|TAA\|TGA\)$//'
You need to escape the RE metacharacters | and parens.
sed 's/\(TAG$\|TAA$\|TGA$\)//'
or you can use the portable option -E to prevent escaping. -E enable extended regular expressions, so your original command will run without any issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With