Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract column using grep

Tags:

I have a data frame with >100 columns each labeled with a unique string. Column 1 represents the index variable. I would like to use a basic UNIX command to extract the index column (column 1) + a specific column string using grep.

For example, if my data frame looks like the following:

Index  A  B  C...D  E  F p1     1  7  4   2  5  6 p2     2  2  1   2  .  3 p3     3  3  1   5  6  1 

I would like to use some command to extract only column "X" which I will specify with grep, and display both column 1 & the column I grep'd. I know that I can use cut -f1 myfile for the first bit, but need help with the grep per column. As a more concrete example, if my grep phrase were "B", I would like the output to be:

Index  B p1     7 p2     2 p3     3 

I am new to UNIX, and have not found much in similar examples. Any help would be much appreciated!!

like image 450
AMS Avatar asked Sep 17 '16 20:09

AMS


People also ask

How do I grep the first column in Unix?

If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string. ^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

How do I grep a line in Linux?

To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.


2 Answers

You need to use awk:

awk '{print $1,$3}' <namefile> 

This simple command allows printing the first ($1) and third ($3) column of the file. The software awk is actually much more powerful. I think you should have a look at the man page of awk.

A nice combo is using grep and awk with a pipe. The following code will print column 1 and 3 of only the lines of your file that contain 'p1':

grep 'p1' <namefile> | awk '{print $1,$3}' 

If, instead, you want to select lines by line number you can replace grep with sed:

sed 1p <namefile> | awk '{print $1,$3}' 

Actually, awk can be used alone in all the examples:

awk '/p1/{print $1,$3}' <namefile> # will print only lines containing p1 awk '{if(NR == 1){print $1,$3}}' <namefile> # Will print only first line 
like image 84
Riccardo Petraglia Avatar answered Oct 13 '22 02:10

Riccardo Petraglia


First figure out the command to find the column number.

columnname=C sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c 

Once you know the number, use cut

cut -f1,3 < datafile  

Combine into one command

cut -f1,$(sed -n "1 s/${columnname}.*//p" datafile |     sed 's/[^\t*]//g' | wc -c) < datafile 

Finished? No, you should improve the first sed command when one header can be a substring of another header: include tabs in your match and put the tabs back in the replacement string.

like image 40
Walter A Avatar answered Oct 13 '22 01:10

Walter A