I have a data frame with >100 columns each labeled with a unique string. Column 1 represents the index variable. I would like to use a basic UNIX command to extract the index column (column 1) + a specific column string using grep
.
For example, if my data frame looks like the following:
Index A B C...D E F p1 1 7 4 2 5 6 p2 2 2 1 2 . 3 p3 3 3 1 5 6 1
I would like to use some command to extract only column "X" which I will specify with grep
, and display both column 1 & the column I grep
'd. I know that I can use cut -f1 myfile
for the first bit, but need help with the grep
per column. As a more concrete example, if my grep
phrase were "B", I would like the output to be:
Index B p1 7 p2 2 p3 3
I am new to UNIX, and have not found much in similar examples. Any help would be much appreciated!!
If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string. ^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.
You need to use awk:
awk '{print $1,$3}' <namefile>
This simple command allows printing the first ($1) and third ($3) column of the file. The software awk is actually much more powerful. I think you should have a look at the man page of awk.
A nice combo is using grep and awk with a pipe. The following code will print column 1 and 3 of only the lines of your file that contain 'p1':
grep 'p1' <namefile> | awk '{print $1,$3}'
If, instead, you want to select lines by line number you can replace grep with sed:
sed 1p <namefile> | awk '{print $1,$3}'
Actually, awk can be used alone in all the examples:
awk '/p1/{print $1,$3}' <namefile> # will print only lines containing p1 awk '{if(NR == 1){print $1,$3}}' <namefile> # Will print only first line
First figure out the command to find the column number.
columnname=C sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c
Once you know the number, use cut
cut -f1,3 < datafile
Combine into one command
cut -f1,$(sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c) < datafile
Finished? No, you should improve the first sed
command when one header can be a substring of another header: include tabs in your match and put the tabs back in the replacement string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With