Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grep not parsing the whole file

Tags:

grep

bash

shell

I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.

The file looks like this

>ATOM      1 N    ALA     1       9.979 -15.619  28.204  1.00  0.00  
>ATOM      2 H1   ALA     1       9.594 -15.053  28.938  1.00  0.00  
>ATOM      3 H2   ALA     1       9.558 -15.358  27.323  1.00  0.00    
>ATOM     12 O    ALA     1       7.428 -16.246  28.335  1.00  0.00  
>ATOM     13 N    HID     2       7.563 -18.429  28.562  1.00  0.00  
>ATOM     14 H    HID     2       6.557 -18.369  28.638  1.00  0.00  
>ATOM     15 CA   HID     2       8.082 -19.800  28.535  1.00  0.00  
>ATOM     24 HE1  HID     2       8.603 -23.670  33.041  1.00  0.00  
>ATOM     25 NE2  HID     2       8.012 -23.749  30.962  1.00  0.00    
>ATOM     29 O    HID     2       5.854 -20.687  28.537  1.00  0.00  
>ATOM     30 N    GLN     3       7.209 -21.407  26.887  1.00  0.00  
>ATOM     31 H    GLN     3       8.168 -21.419  26.566  1.00  0.00  
>ATOM     32 CA   GLN     3       6.271 -22.274  26.157  1.00  0.00  

**16443 lines**  

>ATOM  16425 C116 PA   1089     -34.635   6.968  -0.185  1.00  0.00  
>ATOM  16426 H16R PA   1089     -35.669   7.267  -0.368  1.00  0.00  
>ATOM  16427 H16S PA   1089     -34.579   5.878  -0.218  1.00  0.00  
>ATOM  16428 H16T PA   1089     -34.016   7.366  -0.990  1.00  0.00  
>ATOM  16429 C115 PA   1089     -34.144   7.493   1.177  1.00  0.00  
>ATOM  16430 H15R PA   1089     -33.101   7.198   1.305  1.00  0.00  
>ATOM  16431 H15S PA   1089     -34.179   8.585   1.197  1.00  0.00  
>ATOM  16432 C114 PA   1089     -34.971   6.910   2.342  1.00  0.00  
>ATOM  16433 H14R PA   1089     -35.147   5.847   2.166  1.00  0.00  

**132284 lines**

>ATOM  60981 O    WAT  7952     -46.056  -5.515 -56.245  1.00  0.00  
>ATOM  60982 H1   WAT  7952     -45.185  -5.238 -56.602  1.00  0.00  
>ATOM  60983 H2   WAT  7952     -46.081  -6.445 -56.561  1.00  0.00  
>TER     
>ATOM  60984 O    WAT  7953     -51.005  -3.205 -46.712  1.00  0.00  
>ATOM  60985 H1   WAT  7953     -51.172  -3.159 -47.682  1.00  0.00  
>ATOM  60986 H2   WAT  7953     -51.051  -4.177 -46.579  1.00  0.00  
>TER     
>ATOM  60987 O    WAT  7954     -49.804  -0.759 -49.284  1.00  0.00  
>ATOM  60988 H1   WAT  7954     -48.962  -0.677 -49.785  1.00  0.00  
>ATOM  60989 H2   WAT  7954     -49.868   0.138 -48.903  1.00  0.00

**many lines until the end** 

>TER
>END

I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:

> ATOM  16425 C116 PA   1089     -34.635   6.968  -0.185  1.00  0.00

In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too. I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines. I am working on a "normal" 64 bit system, Kubuntu. Thanks a lot for the help!

like image 714
Gerardo Zerbetto De Palma Avatar asked Oct 16 '25 18:10

Gerardo Zerbetto De Palma


1 Answers

When I try I get

$: grep WAT file.txt
Binary file file.txt matches

grep is assuming it's a binary file. add -a

-a, --text equivalent to --binary-files=text

$: grep -a WAT file.txt|head -3
ATOM  29305 O    WAT  4060     -75.787 -79.125  25.925  1.00  0.00           O
ATOM  29306 H1   WAT  4060     -76.191 -78.230  25.936  1.00  0.00           H
ATOM  29307 H2   WAT  4060     -76.556 -79.670  25.684  1.00  0.00           H

Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.

$: tr "\0" @ <file.txt|grep -n @
16426:ATOM  16421 KA   CAL  1085     -20.614 -22.960  18.641  1.00  0.00          @@
16428:ATOM  16422 KA   CAL  1086      20.249  21.546  19.443  1.00  0.00          @@
16430:ATOM  16423 KA   CAL  1087      22.695 -19.700  19.624  1.00  0.00          @@
16432:ATOM  16424 KA   CAL  1088     -22.147  19.317  17.966  1.00  0.00          @@
like image 87
Paul Hodges Avatar answered Oct 19 '25 10:10

Paul Hodges