What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P
do? Exactly what does it match and how does it match it?
It's from todo.sh. In context:
archive()
{
#defragment blank lines
sed -i.bak -e '/./!d' "$TODO_FILE" ## delete all empty lines
[ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE" ## if verbose mode print completed tasks..
grep "^x " "$TODO_FILE" >> "$DONE_FILE" ## append completed tasks to $DONE_FILE
sed -i.bak '/^x /d' "$TODO_FILE" ## delete completed tasks
cp "$TODO_FILE" "$TMP_FILE"
sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"
## G; Add a newline
## s/\n/&&/; Substitute newline with && (two newlines?)
## /^\([ ~-]*\n\).*\n\1/d; Delete duplicate lines???
## s/\n// Remove newlines
## h Hold: copy pattern space to buffer
## P Print first line of pattern space
if [ $TODOTXT_VERBOSE -gt 0 ]; then
echo "TODO: $TODO_FILE archived."
fi
}
Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G
at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h
command at the end of each input cycle.
Then s/\n/&&/
duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/
indeed matches if the current line is identical to a line in the hold space:
^\([ -~]*\n\)
matches a line at the beginning of the buffer¹
Note that this matches only if the line contains only printable ASCII characters.
If your system supports locales, ^\([[:print:]]*\n\)
would be better.
.*\n
matches at least one subsequent line
\1
matches a line identical to the first line
The extra newline added by the previous s
command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1
is to “anchor” the duplicate at the beginning of a line, otherwise bar
would be considered a duplicate of foobar
. If the current line is a duplicate, the d
command discards it and execution branches to the next line.
If the current line is not a duplicate, s/\n//
discards that extra newline (again, no g
modifier, so only the first newline is removed). Then the h
command results in the hold space containing what it contained before, with the current line prepended. Finally P
prints the current input line.
Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.
¹ Uh, I don't know how you did that, but that should be [ -~]
, not [ ~-]
which wouldn't make any sense.
Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).
<"$TMP_FILE" \
nl -s: | # add line numbers
sort -t: -k2 -u | # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n | # sort by line number
cut -d: -f2- # cut out the line numbers
Oh, you wanted to do this legibly and concisely? Just use awk.
<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'
If the current line hasn't been seen yet, mark it as seen, and print it.
Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort
has the advantage that only sort
needs to keep more than one line of input at a time, and it's designed for this.
Of course, if you don't care about the order of the lines, it's as simple as sort -u
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With