I have few columns in a file, in which the second column has ":" delimiter and I would like to remove the first, third and fourth strings in the second column and left the second string in that column. But I have the normal delimiter space, so I have no idea.
input:
--- 22:16050075:A:G 16050075 A G
--- 22:16050115:G:A 16050115 G A
--- 22:16050213:C:T 16050213 C T
--- 22:16050319:C:T 16050319 C T
--- 22:16050527:C:A 16050527 C A
desired output:
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A
Wrong:
cat df.txt | awk -F: '{print $1, $3, $6, $7, $8}'
--- 22 A
--- 22 G
--- 22 C
--- 22 C
--- 22 C
but I can not do it right. can awk and sed command can do it?
Thank you.
Just use the POSIX
compatible split()
function on $2
as
awk '{split($2,temp,":"); $2=temp[2];}1' file
--- 16050075 16050075 A G
--- 16050115 16050115 G A
--- 16050213 16050213 C T
--- 16050319 16050319 C T
--- 16050527 16050527 C A
Split the column 2 on de-limiter :
, update the $2
value to the required element (temp[2]
) and print the rest of the fields ({}1
re-constructs all individual fields based on FS
and prints it).
Recommend this over using multiple de-limiters, as it alters the absolute position of the individual fields, while split()
makes it easy to retain the position and just extract the required value.
For your updated requirement to add a new column, just do
awk '{split($2,temp,":"); $2=temp[1] FS temp[2];}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A
Alternatively if you have GNU awk
/gawk
you can use its gensub()
for a regex (using POSIX
character class [[:digit]]
) based extraction as
awk '{$2=gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2);}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A
The gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2)
part captures only the first two fields de-limited by :
with the capturing groups \\1
and \\2
and printing the rest of the fields as such.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With