Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate words in a line with sed

Tags:

sed

Purely academic, but it's frustrating me.

I want to correct this text:

there there are are multiple lexical errors in this line line

using sed. I've got this far:

sed 's/\([a-z][a-z]*[ ,\n][ ,\n]*\)\1/\1/g' < file.text

It corrects everything except the final doubled up words!

there are multiple lexical errors in this line line

Can a sed guru please explain why the above doesn't deal with the words at the end?

like image 782
benjwy Avatar asked May 15 '12 11:05

benjwy


1 Answers

This is because in the last case (line) your regex memory 1 will have line (line followed by a space) in it and you are searching for its repetition. Since there is not space after the last line the match fails.

To fix this add a space after the ending word line.

Alternatively you can change the regex to:

sed -e 's/\b\([a-z]\+\)[ ,\n]\1/\1/g'

See it

like image 153
codaddict Avatar answered Sep 23 '22 02:09

codaddict