I have data that looks like this (TAB delimited):
Organ K     ClustNo Analysis
LN    K200  C12     Gene Ontology
LN    K200  C116    Gene Ontology
CN    K200  C2      Gene Ontology
What I want to do is to remove C for every row on 3rd column, except header row:
Organ K     ClustNo Analysis
LN    K200  12      Gene Ontology
LN    K200  116     Gene Ontology
CN    K200  2       Gene Ontology
This won't do because it will affect other columns and header row:
sed 's/C//'
What's the right way to do it?
awk is a good tool for this:
$ awk -F'\t' -v OFS='\t' 'NR>=2{sub(/^C/, "", $3)} 1' file
Organ   K       ClustNo Analysis
LN      K200    12      Gene Ontology
LN      K200    116     Gene Ontology
CN      K200    2       Gene Ontology
-F'\t'
Use tab as the field delimiter on input.
-v OFS='\t'
Use tab as the field delimiter on output
NR>=2 {sub(/^C/, "", $3)}
Remove the initial C from field 3 only for lines after the first line.
1
This is awk's cryptic shorthand for print-the-line.
$ sed -r '2,$ s/(([^\t]+\t+){2})C/\1/' file
Organ   K       ClustNo Analysis
LN      K200    12      Gene Ontology
LN      K200    116     Gene Ontology
CN      K200    2       Gene Ontology
-r
Use extended regular expressions.  (On Mac OSX or other BSD platform, use -E instead.)
2,$ s/(([^\t]+\t){2})C/\1/
This substitution is applied only for lines from 2 to the end of the file.
(([^\t]+\t){2}) matches the first two tab-separated columns.  This assumes that only one tab separates each column.  Because the regex is enclosed in parens, what it matches will be available later as \1.
C this match C.
\1 replaces the matched text with just the first two columns, not the C..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With