I have a file where you want to delete line matching by pattern and remove strings above and below.
By example:
FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<<BBBBBB7B<BBBBBB<B<
@HISEQ:102:h9u5badxx:1:1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFFFFFFFBBFFFFFFFFFFFFFBFBFFFFFFFFFBFFFBFFFFFBFFFFFFFFFBFB
@HISEQ:102:h9u5badxx:1:1101:15368:2194 1:N:0:CTGT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700<7770<BBB0<0<BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1:1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
I want to remove second block which does not contain the nucleotide sequence.
The end result:
`FFFFIFIBBFFFFFFFFFFFFFBBBBFBBBBFBBBB77<<BBBBBB7B<BBBBBB<B<
@HISEQ:102:h9u5badxx:1:1101:13002:2147 1:N:0:CTGT
GATCCCCGTCTATCAGATACACGTTACTCAGCTAGTGCGAATGCGAACGCGAAATTTT
+
FFIFBFFIFFBBBFFFFFFFBBFFBFFBBBFFFBB7BBBBBBFFFBB700<7770<BBB0<0<BFFBFBFFFFF
@HISEQ:102:h9u5badxx:1:1101:19167:2169 1:N:0:CTGT
GATCTCATATAGGGCAGCGTGGTCGCGGC
`
Pattern which matched this block
'^.+$(\n)^(@HISEQ).*$(\n)^\+'
works in perl and javascript, but not sed.
Because sed does not work with line break.
I found the solution
sed -e ':a;N;$!ba;s/\n/ /' test
But this code replace line break to space. If insert to this code my regexp:
sed -e ':a;N;$!ba;/^.+$(\n)^(@HISEQ).*$(\n)^\+/d' test
this does not work. Can you help me find the solution of this problem?
I'm just stupid. I misunderstood the file format. Input:
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
+
JJJHIIJFIJJJJ=BFFFFFEEEEEEDDDDDDDDDDBD
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
How to edit the regular exp to get what you want
output:
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
If I understand you correctly, then
sed ':loop; N; /\n+/ ! { $ ! b loop }; /\n@HISEQ[^\n]\+\n+/ d' foo.txt
will work. This is as follows:
:loop # in a loop
N # fetch more lines
/\n+/ ! { $ ! b loop } # until one starts with + or is the last line
/\n@HISEQ[^\n]\+\n+/ d # if the penultimate line of all that begins with @HISEQ,
# discard the lot.
That last pattern is using the fact that it is checked right after the first line that begins with + is found, so the \n+ at the end of it uniquely matches the start of the last line in the block.
To remove the second block, you can just do:
awk 'NR!=2' RS=+ ORS=+ input
But I would suspect you want something more like:
awk '/[GATC]{5,}\n/' RS=+ ORS=+ input
or
awk '/\n[GATC]*\n/' RS=+ ORS=+ input
If I understand you correctly, then
sed ':loop; N; /\n+/ ! { $ ! b loop }; /\n@HISEQ[^\n]\+\n+/ d' foo.txt
will work. This is as follows:
:loop # in a loop
N # fetch more lines
/\n+/ ! { $ ! b loop } # until one starts with + or is the last line
/\n@HISEQ[^\n]\+\n+/ d # if the penultimate line of all that begins with @HISEQ,
# discard the lot.
That last pattern is using the fact that it is checked right after the first line that begins with + is found, so the \n+ at the end of it uniquely matches the start of the last line in the block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With