Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

search for multiple conditioned string with awk

Tags:

linux

unix

sed

awk

I have a set of data that looks like this

col1    col2    col3    col4
ABC1    DEF1    GHI1    cond1
ABC2    DEF2    GHI2    cond1-cond2
ABC3    DEF3    GHI3    cond2/cond1
ABC4    DEF4    GHI4    cond2 cond1
ABC5    DEF5    GHI5    cond4/cond1
ABC6    DEF6    GHI6    cond1
ABC7    DEF7    GHI7    mcond1
ABC8    DEF8    GHI8    cond2
ABC9    DEF9    GHI9    cond1 at 50
ABCa    DEFa    GHIa    con cond1
ABCb    DEFb    GHIb    no. cond1
ABCc    DEFc    GHIc    cond1 something
ABCc    DEFc    GHIc    Mcond1

I am trying to write a couple of commands to separate the data according to the "col4" to get:

  1. A file that contains the string "cond1" with ANYTHING before or after it EXCEPT for the "M" and "m" letters. I am currently using this command (which doesn't exclude M and m, obviously) for this:

    awk 'BEGIN{IGNORECASE=1} $4 ~ /.cond1/ || $4 ~ /cond1./ ' /filepath.tab
    

    please also not that the 4th, 9th and the 11th rows of data contains spaces between "cond1" and other strings, I want this INCLUDED in the file. (does the "." in the command include spaces?)

  2. I wrote a command to get a file that contains "cond1" ONLY in the "col4"

    awk 'BEGIN{IGNORECASE=1} $4 ~ /^cond1$/ '  /filepath.tab
    

    it is working fine with most of the data, though i am getting these 2 bugs in the results:

    ABC9    DEF9    GHI9    cond1 at 50
    ABCc    DEFc    GHIc    cond1 something
    

Can anyone please help me with these 2 commands? Thanks.

I am looking for an output of 3 files that does not contain any duplicates, here's an example: File1 (containing all the combinations with cond1):

col1    col2    col3    col4
ABC2    DEF2    GHI2    cond1-cond2
ABC3    DEF3    GHI3    cond2/cond1
ABC4    DEF4    GHI4    cond2 cond1
ABC5    DEF5    GHI5    cond4/cond1
ABC9    DEF9    GHI9    cond1 at 50
ABCa    DEFa    GHIa    con cond1
ABCb    DEFb    GHIb    no. cond1
ABCc    DEFc    GHIc    cond1 something

File2 (containing cond1 alone):

col1    col2    col3    col4
ABC1    DEF1    GHI1    cond1
ABC6    DEF6    GHI6    cond1

File3 (contains anything that wasn't included in file1 and file2):

col1    col2    col3    col4
ABC7    DEF7    GHI7    mcond1
ABC8    DEF8    GHI8    cond2
ABCc    DEFc    GHIc    Mcond1

Note that the total rows of the 3 files is the same as the original file I am sorry if this sounds rather complicated, but this is the case I actually have.

like image 249
Error404 Avatar asked Jan 04 '13 10:01

Error404


1 Answers

The problem is that column 4 has spaces in and your file isn't formatted as expected try awk -F'\t' {print $4} file and notice it's probably not what you expect.

First turn the file in CSV format with sed

$ sed 's/\s\+/,/;s/\s\+/,/;s/\s\+/,/;' file
col1,col2,col3,col4
ABC1,DEF1,GHI1,cond1
ABC2,DEF2,GHI2,cond1-cond2
ABC3,DEF3,GHI3,cond2/cond1
ABC4,DEF4,GHI4,cond2 cond1
ABC5,DEF5,GHI5,cond4/cond1
ABC6,DEF6,GHI6,cond1
ABC7,DEF7,GHI7,mcond1
ABC8,DEF8,GHI8,cond2
ABC9,DEF9,GHI9,cond1 at 50
ABCa,DEFa,GHIa,con cond1
ABCb,DEFb,GHIb,no. cond1
ABCc,DEFc,GHIc,cond1 something
ABCc,DEFc,GHIc,Mcond1

This only converts the first 3 whitespace groups to commas leaving col4 as you want. To store the change in the file use the -i option:

sed -i 's/\s\+/,/;s/\s\+/,/;s/\s\+/,/;' file

Now the field separator , can be used.

awk 'BEGIN{FS=",";OFS="\t"} NR==1{print $1,$2,$3,$4} $4~/cond1./|| $4~/[^mM]cond1/{print $1,$2,$3,$4}' file
col1    col2    col3    col4
ABC2    DEF2    GHI2    cond1-cond2
ABC3    DEF3    GHI3    cond2/cond1
ABC4    DEF4    GHI4    cond2 cond1
ABC5    DEF5    GHI5    cond4/cond1
ABC9    DEF9    GHI9    cond1 at 50
ABCa    DEFa    GHIa    con cond1
ABCb    DEFb    GHIb    no. cond1
ABCc    DEFc    GHIc    cond1 something

$ awk 'BEGIN{FS=",";OFS="\t"} NR==1{print $1,$2,$3,$4} $4~/^cond1$/{print $1,$2,$3,$4}' file 
col1    col2    col3    col4
ABC1    DEF1    GHI1    cond1
ABC6    DEF6    GHI6    cond1

$ awk 'BEGIN{FS=",";OFS="\t"} $4!~/cond1/ || $4~/[Mm]cond1/{print $1,$2,$3,$4}' file
col1    col2    col3    col4
ABC7    DEF7    GHI7    mcond1
ABC8    DEF8    GHI8    cond2
ABCc    DEFc    GHIc    Mcond1

Just redirect to save the newfiles awk '{...}' > file1.

like image 128
Chris Seymour Avatar answered Oct 16 '22 06:10

Chris Seymour