Part of my 4 column output looks like this:
5 cc1kcc1kc 5 cc1kcc1kc
5 cc2ppggg 5 cc2ppggg
6 ccg12qqqqqqqqqqqqggg 10 ccccg11qqqqqqqqqqqggggg
3 4qqqqcgc1q 12 cgccgccgccgc
I only want the second and fourth columns changed, is there a way with awk/sed to remove the numbers with the characters next to them? Or would it be easier/better to use a perl script to perform this transformation?
The resulting output should look like this:
5 ccccc 5 ccccc
5 ccggg 5 ccggg
6 ccgggg 10 ccccgggggg
3 cgc 12 cgccgccgccgc
To find the duplicate character from the string, we count the occurrence of each character in the string. If count is greater than 1, it implies that a character has a duplicate entry in the string. In above example, the characters highlighted in green are duplicate characters.
Taking the question literally, this removes the next n characters from fields 2 and 4 for any n embedded in the field.
perl -lane 'for $i (1, 3) {@nums = $F[$i] =~ /(\d+)/g; for $num (@nums) {$F[$i] =~ s/$num.{$num}//}}; print join("\t", @F)'
The other answers remove the number and all the characters that follow that are the same.
To illustrate the difference between my answer and the others, use the following input:
6 ccg8qqqqqqqqqqqqggg 10 ccccg3qqqqqqqqqqqggggg
My version outputs this:
6 ccgqqqqggg 10 ccccgqqqqqqqqggggg
while theirs output this:
6 ccgggg 10 ccccgggggg
With perl
:
perl -pe 's/\d+([^\d\s])\1*//g'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With