Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove strings by a specific delimiter

Tags:

linux

bash

sed

awk

I have few columns in a file, in which the second column has ":" delimiter and I would like to remove the first, third and fourth strings in the second column and left the second string in that column. But I have the normal delimiter space, so I have no idea.

input:

--- 22:16050075:A:G 16050075 A G
--- 22:16050115:G:A 16050115 G A
--- 22:16050213:C:T 16050213 C T
--- 22:16050319:C:T 16050319 C T
--- 22:16050527:C:A 16050527 C A

desired output:

--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A

Wrong:
cat df.txt | awk -F: '{print $1, $3, $6, $7, $8}'

--- 22 A
--- 22 G
--- 22 C
--- 22 C
--- 22 C

but I can not do it right. can awk and sed command can do it?

Thank you.

like image 909
Peter Chung Avatar asked Dec 14 '22 00:12

Peter Chung


1 Answers

Just use the POSIX compatible split() function on $2 as

awk '{split($2,temp,":"); $2=temp[2];}1' file
--- 16050075 16050075 A G
--- 16050115 16050115 G A
--- 16050213 16050213 C T
--- 16050319 16050319 C T
--- 16050527 16050527 C A

Split the column 2 on de-limiter :, update the $2 value to the required element (temp[2]) and print the rest of the fields ({}1 re-constructs all individual fields based on FS and prints it).

Recommend this over using multiple de-limiters, as it alters the absolute position of the individual fields, while split() makes it easy to retain the position and just extract the required value.


For your updated requirement to add a new column, just do

awk '{split($2,temp,":"); $2=temp[1] FS temp[2];}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A

Alternatively if you have GNU awk/gawk you can use its gensub() for a regex (using POSIX character class [[:digit]]) based extraction as

awk '{$2=gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2);}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A

The gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2) part captures only the first two fields de-limited by : with the capturing groups \\1 and \\2 and printing the rest of the fields as such.

like image 142
Inian Avatar answered Dec 27 '22 14:12

Inian