Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates from tar archive

Tags:

bash

archive

tar

I'm trying to create an archive of multiple text files. Sometimes these files are updated, when these files are updated I use the --update option in tar to append these files to the archive.

Say we have two files, test1.txt, and test2.txt. These files are added to archive test.tar.

Inspecting the tar with tar -tf test.tar

I get as expected:

test1.txt
test2.txt

Now if I update test2.txt, and append it into the archive using tar -f test.tar -u test2.txt.

I expect the output of running tar -tf test.tar to be:

test1.txt
test2.txt

But instead I get:

test1.txt
test2.txt
test2.txt

So how do I shake this tar to remove the older test2.txt? I know that after extracting the archive, I'd get only the most recent changes to both files, so this problem might seem trivial in this demo, but I'm actually archiving thousands of 5000-line files so the archive sizes get comically large with repeated runs.

What I'm currently doing is I'm extracting the files into a temp directory then re-archiving each time my script is run. This is obviously very inefficient. I'm hoping there's a tar option I'm missing somewhere.

like image 723
Osama Adam Avatar asked Sep 18 '25 06:09

Osama Adam


1 Answers

TAR is simply a concatenation of the raw file contents with some metadata mixed in between. As you noticed, updating a file simply appends the file to the end of the TAR and, by convention, the last file occurring in the TAR "wins". TAR does not simply update a file because that could mean that all file contents after the updated file might have to be moved some bytes away to make space for the larger newer file version.

There actually is a TAR option not having been mentioned here which fits your use case: --occurrence=[NUMBER]. With this option, you can specify which of the multiple versions of a file with the same name/path is to be extracted or deleted. It would work fine with your simple example. This is how I set it up:

echo foo > test1.txt
echo foo > test2.txt
tar -cf updated.tar test1.txt test2.txt
sleep 1s
echo barbara > test2.txt
tar --update -f updated.tar test1.txt test2.txt
sleep 1s
echo foobar > test2.txt
tar --update -f updated.tar test1.txt test2.txt
tar tvlf updated.tar
    -rwx------ user/group   4 2022-03-29 19:00 test1.txt
    -rwx------ user/group   4 2022-03-29 19:00 test2.txt
    -rwx------ user/group   8 2022-03-29 19:01 test2.txt
    -rwx------ user/group   7 2022-03-29 19:01 test2.txt

Note that tar --update will only check the timestamp not the contents and the timestamp only has 1s granularity! Therefore, we need to wait 1s to be sure that the timestamp is at least one second later or tar will not add it to the archive. This is especially important when copy-pasting this code.

Simply calling --delete will delete all versions:

tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
    -rwx------ user/group   4 2022-03-29 19:00 test1.txt

When specifying --occurrence=1, only the first occurrence, i.e., the oldest version will be deleted:

tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
    -rwx------ user/group   4 2022-03-29 19:00 test1.txt
    -rwx------ user/group   8 2022-03-29 19:01 test2.txt
    -rwx------ user/group   7 2022-03-29 19:01 test2.txt

Unfortunately, for --delete, you can only delete exactly one file version. So, you would have to repeat deleting the oldest version until only the most recent is left. It is possible to do it in bash and that would at least be more space-efficient than extracting it to a temporary folder but it would probably be slower because it has to go over the archive many times and each time the archive is basically completely rewritten in place.

I recommend using ratarmount, which I wrote, instead. It will mount the archive (without actually extracting it) and expose a folder view showing the most recent versions of each file. Using this, you can create the new trimmed-down archive:

python3 -m pip install --user ratarmount
ratarmount updated.tar
ls -lA updated/
    -rwx------ 1 user group 4 Mar 29 19:14 test1.txt
    -rwx------ 1 user group 7 Mar 29 19:14 test2.txt
tar -c -f most-recent.tar -C updated/ .
tar tvlf updated.tar
    drwxrwxrwx user/group   0 2022-03-29 19:00 ./
    -rwx------ user/group   4 2022-03-29 19:00 ./test1.txt
    -rwx------ user/group   7 2022-03-29 19:01 ./test2.txt

And there you have it. The output of tar tvlf looks a bit different with the preceding dot because we used -C and specified to archive the . folder. Normally, this doesn't hurt but you can circumvent this with any of these slightly more problematic alternatives:

tar -c -f most-recent.tar -C updated/ test1.txt test2.txt
tar -c -f most-recent.tar -C updated/ $( cd updated && find . -mindepth 1 -maxdepth 1 )
( cd updated/ && tar -c -f ../most-recent.tar {[^.],.[!.],..?}*; )

If you encounter problems with ratarmount please open an issue here. Note that ratarmount even exposes those older versions but in well-hidden special folders:

ratarmount updated.tar
ls -lA updated/test2.txt.versions/
    -rwx------ 1 user group 4 Mar 29 20:10 1
    -rwx------ 1 user group 8 Mar 29 20:10 2
    -rwx------ 1 user group 7 Mar 29 20:10 3

The file names inside the special .versions folder match the arguments given to --occurrence.


The above mentioned version in bash with --occurrence would look like this:

function deleteAllButMostRecentInTar()
{
    local archive=$1
    local filesToDelete=$( mktemp )

    while true; do
        tar --list --file "$archive" | sort | uniq -c |
            sed -n -E '/^[ \t]*1 /d; s|^[ \t]*[0-9]+ ||p' > "$filesToDelete"
        if [[ -s "$filesToDelete" ]]; then
            local fileCount=$( cat -- "$filesToDelete" | wc -l )
            echo -n "Found $fileCount files with more than version. Deleting ..."
            tar --delete --occurrence=1 --files-from="$filesToDelete" \
                --file "$archive"
            echo " OK"
        else
            break
        fi
    done
    rm -- "$filesToDelete"
    echo
}

deleteAllButMostRecentInTar updated.tar
tar tvlf updated.tar
    -rwx------ user/group   4 2022-03-29 19:00 test1.txt
    -rwx------ user/group   7 2022-03-29 19:01 test2.txt
like image 118
mxmlnkn Avatar answered Sep 20 '25 23:09

mxmlnkn