Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does this sed expression from todo.sh do?

Tags:

regex

shell

sed

What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P do? Exactly what does it match and how does it match it?

It's from todo.sh. In context:

archive()
{
    #defragment blank lines
    sed -i.bak -e '/./!d' "$TODO_FILE"                     ## delete all empty lines
    [ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE"  ## if verbose mode print completed tasks..
    grep "^x " "$TODO_FILE" >> "$DONE_FILE"                ## append completed tasks to $DONE_FILE
    sed -i.bak '/^x /d' "$TODO_FILE"                       ## delete completed tasks
    cp "$TODO_FILE" "$TMP_FILE"


    sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"


    ## G;                       Add a newline
    ## s/\n/&&/;                Substitute newline with && (two newlines?)
    ## /^\([ ~-]*\n\).*\n\1/d;  Delete duplicate lines???
    ## s/\n//                   Remove newlines
    ## h                        Hold: copy pattern space to buffer
    ## P                        Print first line of pattern space
    if [ $TODOTXT_VERBOSE -gt 0 ]; then
    echo "TODO: $TODO_FILE archived."
    fi
}
like image 831
Leftium Avatar asked Dec 27 '22 21:12

Leftium


1 Answers

Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h command at the end of each input cycle.

Then s/\n/&&/ duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/ indeed matches if the current line is identical to a line in the hold space:
    ^\([ -~]*\n\) matches a line at the beginning of the buffer¹
        Note that this matches only if the line contains only printable ASCII characters.
        If your system supports locales, ^\([[:print:]]*\n\) would be better.
    .*\n matches at least one subsequent line
    \1 matches a line identical to the first line
The extra newline added by the previous s command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1 is to “anchor” the duplicate at the beginning of a line, otherwise bar would be considered a duplicate of foobar. If the current line is a duplicate, the d command discards it and execution branches to the next line.

If the current line is not a duplicate, s/\n// discards that extra newline (again, no g modifier, so only the first newline is removed). Then the h command results in the hold space containing what it contained before, with the current line prepended. Finally P prints the current input line.

Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.

¹ Uh, I don't know how you did that, but that should be [ -~], not [ ~-] which wouldn't make any sense.


Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).

<"$TMP_FILE" \
nl -s: |              # add line numbers
sort -t: -k2 -u |     # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n |     # sort by line number
cut -d: -f2-          # cut out the line numbers

Oh, you wanted to do this legibly and concisely? Just use awk.

<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'

If the current line hasn't been seen yet, mark it as seen, and print it.

Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort has the advantage that only sort needs to keep more than one line of input at a time, and it's designed for this.

Of course, if you don't care about the order of the lines, it's as simple as sort -u.

like image 118
Gilles 'SO- stop being evil' Avatar answered Jan 11 '23 15:01

Gilles 'SO- stop being evil'