Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tar archive preserving hardlinks

Tags:

go

Using the archive/tar package in Go, it doesn't seem possible to access the number of hardlinks a file has. However, I remember reading somewhere that tar'ing a directory or file can preserve the hardlinks.

Is there some package in Go that can help me do this?

like image 886
steve Avatar asked Feb 06 '23 14:02

steve


1 Answers

tar does preserve the hardlinks.

Here's a sample directory with three hard-linked files and one file with a single link:

foo% vdir .
total 16
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 bar.txt
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 foo.txt
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 test.txt
-rw-r--r-- 1 kostix kostix 9 Jul 12 19:49 xyzzy.txt

Now we archive it using GNU tar and verify it indeed added the links (because we didn't pass it the --hard-dereferece command-line option):

foo% tar -cf ../foo.tar .
foo% tar -tvf ../foo.tar
drwxr-xr-x kostix/kostix     0 2016-07-12 19:49 ./
-rw-r--r-- kostix/kostix     9 2016-07-12 19:49 ./xyzzy.txt
-rw-r--r-- kostix/kostix     5 2016-07-12 19:37 ./bar.txt
hrw-r--r-- kostix/kostix     0 2016-07-12 19:37 ./test.txt link to ./bar.txt
hrw-r--r-- kostix/kostix     0 2016-07-12 19:37 ./foo.txt link to ./bar.txt

The documentation of archive/tar refers to a bunch of documents defining the standard on the tar archive (and unfortunately, there's no a single standard: for instance, GNU tar does not support POSIX extended attributes, while BSD tar (which relies on libarchive) does, and so does pax). To cite its bit on the hardlinks:

LNKTYPE

This flag represents a file linked to another file, of any type, previously archived. Such files are identified in Unix by each file having the same device and inode number. The linked-to name is specified in the linkname field with a trailing null.

So, a hadrlink is an enrty of a special type ('1') which refers to some preceding (already archived) file by its name.

So let's create a playground example.

We base64-encode our archive:

foo% base64 <../foo.tar | xclip -selection clipboard

…and write the code. The archive contains a single directory, one file (type '0') another file (type '0') followed by two hardlinks (type '1') to it.

The output from the playground example:

Archive entry '5': ./
Archive entry '0': ./xyzzy.txt
Archive entry '0': ./bar.txt
Archive entry '1': ./test.txt link to ./bar.txt
Archive entry '1': ./foo.txt link to ./bar.txt

So your link-counting code should:

  1. Scan the entire archive record-by-record.

  2. Remember any regular file (type archive/tar.TypeReg or type archive/tar.TypeRegA) already processed, and have a counter associated with it, which starts at 1.

    Well, in reality, you'd better be exclusive and record entries of all types except symbolic links and directories — because tar archives can contain nodes for character and block devices, and FIFOs (named pipes).

  3. When you encounter a hard link (type archive/tar.TypeReg),

    1. Read the Linkname field of its header.
    2. Look your list of "seen" files up and increase the counter of its entry which matches that name.

Update

As the OP actually wanted to know how to manage hardlinks on the source filesystem, here's the update.

The chief idea is that on a filesystem with POSIX semantics:

  • A directory entry designating a file actually points to a special filesystem metadata block called "inode". The inode contains the number of directory entries pointing to it.

    Creating a hardlink is actually just:

    1. Creating a new directory entry pointing to the inode of the original (source) file — "the link target" in the lns terms.
    2. Incrementing the link counter in that inode.
  • Hence any file is uniquely identified by two integer numbers: the "device number" identifying the physical device hosting the filesystem on which the file is located, and inode number identifying the file's data.

    It follows, that if two files have the same (device, inode) pairs, they represent the same content. Or, if we put it differently, one is a hardlink to the other.

So, adding files to a tar archive while preserving the hardlinks works this way:

  1. Having added a file, save its (device, inode) pair to some lookup table.

  2. When adding another file, figure out its (device, inode) pair and look it up in that table.

    If a matching entry is found, the file's data was already streamed, and we should add a hardlink.

    Otherwise, behave as in step (1).

So here's the code:

package main

import (
    "archive/tar"
    "io"
    "log"
    "os"
    "path/filepath"
    "syscall"
)

type devino struct {
    Dev uint64
    Ino uint64
}

func main() {
    log.SetFlags(0)

    if len(os.Args) != 2 {
        log.Fatalf("Usage: %s DIR\n", os.Args[0])
    }

    seen := make(map[devino]string)

    tw := tar.NewWriter(os.Stdout)

    err := filepath.Walk(os.Args[1],
        func(fn string, fi os.FileInfo, we error) (err error) {
            if we != nil {
                log.Fatal("Error processing directory", we)
            }

            hdr, err := tar.FileInfoHeader(fi, "")
            if err != nil {
                return
            }

            if fi.IsDir() {
                err = tw.WriteHeader(hdr)
                return
            }

            st := fi.Sys().(*syscall.Stat_t)
            di := devino{
                Dev: st.Dev,
                Ino: st.Ino,
            }

            orig, ok := seen[di]
            if ok {
                hdr.Typeflag = tar.TypeLink
                hdr.Linkname = orig
                hdr.Size = 0

                err = tw.WriteHeader(hdr)
                return
            }

            fd, err := os.Open(fn)
            if err != nil {
                return
            }
            err = tw.WriteHeader(hdr)
            if err != nil {
                return
            }
            _, err = io.Copy(tw, fd)
            fd.Close() // Ignoring error for a file opened R/O
            if err == nil {
                seen[di] = fi.Name()
            }
            return err
        })

    if err != nil {
        log.Fatal(err)
    }

    err = tw.Close()
    if err != nil {
        log.Fatal(err)
    }

    return
}

Note that it's quite inadequate:

  • It improperly deals with file and directory names.

  • It does not attempt to properly work with symlinks and FIFOs, and skip Unix-domain sockets etc.

  • It assumes it works in a POSIX environment.

    On non-POSIX systems, the Sys() method called on a value of type os.FileInfo might return something else rather than the POSIX'y syscall.Stat_t.

    Say, on Windows, there are multiple filesystems hosted by different "disks" or "drives". I have no idea how Go handles that. Maybe the "device number" had to be emulated somehow for this case.

On the other hand, it shows how to handle hardlinks:

  • Set the "Linkname" field of the header struct.
  • Reset the "Size" field of the header to 0 (because no data will follow).

You might also want to use another approach to maintain the lookup table: if most of your files are expected to be located on the same physical filesystem, each entry wastes an uint64 for the device number of each entry. So a hierarchy of maps might be a sensible thing to do: the first maps device numbers to another map which maps inode numbers to file names.

like image 145
kostix Avatar answered Feb 09 '23 04:02

kostix