I have a large (600 odd) set of search and replace terms that I need to run as a sed script over some files. The problem is that the search terms are NOT orthogonal... but I think I can get away with it by sorting by line length (i.e. pull out the longest matches first, and then alphabetically within each length. So given an unsort set of:
aaba
aa
ab
abba
bab
aba
what I want is a sorted set such as:
abba
aaba
bab
aba
ab
aa
Is there a way of doing it by say prepending the line lenght and sorting by a field?
For bonus marks :-) !!! The search and replace is actually simply a case of replacing term with _term_ and the sed code I was going to use was s/term/_term_/g How would I write the regex to avoid replacing terms already within _ pairs?
You can do this in a one-line Perl script:
perl -e 'print sort { length $b<=>length $a || $b cmp $a } <>' input
You could compact it all into one regexp:
$ sed -e 's/\(aaba\|aa\|abba\)/_\1_/g'
testing words aa, aaba, abba.
testing words _aa_, _aaba_, _abba_.
If I understand your question correctly, this will solve all your problems: No "double replacement" and always matching the longest word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With