I need to highlight every duplicate word in the text with *
symbol.
For example
lol foo lol bar foo bar
should be
lol foo *lol* bar *foo* *bar*
I tried with the following command:
echo "lol foo lol bar foo bar" | sed -r -e 's/(\b[a-zA-Z]+\b)([^*]+)(\1)/\1\2*\3*/'
It gives me:
lol foo *lol* bar foo bar
Then I added g
flag:
lol foo *lol* bar foo *bar*
But foo
is not highlighted.
I know that it happens because sed
doesn't look behind if the match was found.
Can I handle it with only sed
?
Sed
is not the best tool for this task. It doesn't look-ahead, look-behind and non-greedy quantifiers, but give a try to the following command:
sed -r -e ':a ; s/\b([a-zA-Z]+)\b(.*) (\1)( |$)/\1\2 *\3* / ; ta'
It uses conditional branching to execute the substitution command until it fails. Also, you cannot check ([^*]+)
because for second round it has to traverse some *
of the first substitution, your option is a greedy .*
. And last, you cannot match (\1)
only because it would match the first string lol
again and again. You need some context like surrounded by spaces or end of line.
The command yields:
lol foo *lol* bar *foo* *bar*
UPDATE: An improvement provided by potong in comments:
sed -r ':a;s/\b(([[:alpha:]]+)\s.*\s)\2\b/\1*\2*/;ta' file
Using awk
awk '{for (i=1;i<=NF;i++) if (a[$i]++>=1) printf "*%s* ",$i; else printf "%s ",$i; print ""}' file
lol foo *lol* bar *foo* *bar*
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With