I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt
but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt
. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' '
and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' '
and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!
When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" @ <file.txt|grep -n @
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 @@
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 @@
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 @@
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 @@
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With