I have a set of data that looks like this
col1 col2 col3 col4
ABC1 DEF1 GHI1 cond1
ABC2 DEF2 GHI2 cond1-cond2
ABC3 DEF3 GHI3 cond2/cond1
ABC4 DEF4 GHI4 cond2 cond1
ABC5 DEF5 GHI5 cond4/cond1
ABC6 DEF6 GHI6 cond1
ABC7 DEF7 GHI7 mcond1
ABC8 DEF8 GHI8 cond2
ABC9 DEF9 GHI9 cond1 at 50
ABCa DEFa GHIa con cond1
ABCb DEFb GHIb no. cond1
ABCc DEFc GHIc cond1 something
ABCc DEFc GHIc Mcond1
I am trying to write a couple of commands to separate the data according to the "col4" to get:
A file that contains the string "cond1" with ANYTHING before or after it EXCEPT for the "M" and "m" letters. I am currently using this command (which doesn't exclude M and m, obviously) for this:
awk 'BEGIN{IGNORECASE=1} $4 ~ /.cond1/ || $4 ~ /cond1./ ' /filepath.tab
please also not that the 4th, 9th and the 11th rows of data contains spaces between "cond1" and other strings, I want this INCLUDED in the file. (does the "." in the command include spaces?)
I wrote a command to get a file that contains "cond1" ONLY in the "col4"
awk 'BEGIN{IGNORECASE=1} $4 ~ /^cond1$/ ' /filepath.tab
it is working fine with most of the data, though i am getting these 2 bugs in the results:
ABC9 DEF9 GHI9 cond1 at 50
ABCc DEFc GHIc cond1 something
Can anyone please help me with these 2 commands? Thanks.
I am looking for an output of 3 files that does not contain any duplicates, here's an example: File1 (containing all the combinations with cond1):
col1 col2 col3 col4
ABC2 DEF2 GHI2 cond1-cond2
ABC3 DEF3 GHI3 cond2/cond1
ABC4 DEF4 GHI4 cond2 cond1
ABC5 DEF5 GHI5 cond4/cond1
ABC9 DEF9 GHI9 cond1 at 50
ABCa DEFa GHIa con cond1
ABCb DEFb GHIb no. cond1
ABCc DEFc GHIc cond1 something
File2 (containing cond1 alone):
col1 col2 col3 col4
ABC1 DEF1 GHI1 cond1
ABC6 DEF6 GHI6 cond1
File3 (contains anything that wasn't included in file1 and file2):
col1 col2 col3 col4
ABC7 DEF7 GHI7 mcond1
ABC8 DEF8 GHI8 cond2
ABCc DEFc GHIc Mcond1
Note that the total rows of the 3 files is the same as the original file I am sorry if this sounds rather complicated, but this is the case I actually have.
The problem is that column 4 has spaces in and your file isn't formatted as expected try awk -F'\t' {print $4} file
and notice it's probably not what you expect.
First turn the file in CSV format with sed
$ sed 's/\s\+/,/;s/\s\+/,/;s/\s\+/,/;' file
col1,col2,col3,col4
ABC1,DEF1,GHI1,cond1
ABC2,DEF2,GHI2,cond1-cond2
ABC3,DEF3,GHI3,cond2/cond1
ABC4,DEF4,GHI4,cond2 cond1
ABC5,DEF5,GHI5,cond4/cond1
ABC6,DEF6,GHI6,cond1
ABC7,DEF7,GHI7,mcond1
ABC8,DEF8,GHI8,cond2
ABC9,DEF9,GHI9,cond1 at 50
ABCa,DEFa,GHIa,con cond1
ABCb,DEFb,GHIb,no. cond1
ABCc,DEFc,GHIc,cond1 something
ABCc,DEFc,GHIc,Mcond1
This only converts the first 3 whitespace groups to commas leaving col4 as you want. To store the change in the file use the -i
option:
sed -i 's/\s\+/,/;s/\s\+/,/;s/\s\+/,/;' file
Now the field separator ,
can be used.
awk 'BEGIN{FS=",";OFS="\t"} NR==1{print $1,$2,$3,$4} $4~/cond1./|| $4~/[^mM]cond1/{print $1,$2,$3,$4}' file
col1 col2 col3 col4
ABC2 DEF2 GHI2 cond1-cond2
ABC3 DEF3 GHI3 cond2/cond1
ABC4 DEF4 GHI4 cond2 cond1
ABC5 DEF5 GHI5 cond4/cond1
ABC9 DEF9 GHI9 cond1 at 50
ABCa DEFa GHIa con cond1
ABCb DEFb GHIb no. cond1
ABCc DEFc GHIc cond1 something
$ awk 'BEGIN{FS=",";OFS="\t"} NR==1{print $1,$2,$3,$4} $4~/^cond1$/{print $1,$2,$3,$4}' file
col1 col2 col3 col4
ABC1 DEF1 GHI1 cond1
ABC6 DEF6 GHI6 cond1
$ awk 'BEGIN{FS=",";OFS="\t"} $4!~/cond1/ || $4~/[Mm]cond1/{print $1,$2,$3,$4}' file
col1 col2 col3 col4
ABC7 DEF7 GHI7 mcond1
ABC8 DEF8 GHI8 cond2
ABCc DEFc GHIc Mcond1
Just redirect to save the newfiles awk '{...}' > file1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With