Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sparse files: How to find contents

If I create a file, use lseek(2) to jump to a high position in the (empty) file, then write some valuable information there, I create a sparse file on Unix system (probably depending on the file system I use, but let's assume I'm using a typical Unix file system like ext4 or similar, there this is the case).

If I then lseek(2) to an even higher position in the file, write something there as well, I end up with a sparse file which contains somewhere in its middle the valuable information, surrounded by a huge amount of sparse file. I'd like to find this valuable information within the file without having to read it completely.

Example:

$ python
f = open('sparse', 'w')
f.seek((1<<40) + 42)
f.write('foo')
f.seek((1<<40) * 2)
f.write('\0')
f.close()

This will create a 2TB file which uses only 8k of disk space:

$ du -h sparse 
8.0K    sparse

Somewhere in the middle of it (at 1TB + 42 bytes) is the valuable information (foo).

I can find it using cat sparse of course, but that will read the complete file and print immense amounts of zero bytes. I tried with smaller sizes and found that this method will take about 3h to print the three characters on my computer.

The question is:

Is there a way to find the information stored in a sparse file without reading all the empty blocks as well? Can I somehow find out where empty blocks are in a sparse file using standard Unix methods?

like image 332
Alfe Avatar asked Nov 08 '22 17:11

Alfe


1 Answers

Just writing an answer based on the previous comments:

#!/usr/bin/env python3
from errno import ENXIO
from os import lseek
from sys import argv, stderr

SEEK_DATA = 3
SEEK_HOLE = 4

def get_ranges(fobj):
    ranges = []
    end = 0

    while True:
        try:
            start = lseek(fobj.fileno(), end, SEEK_DATA)
            end = lseek(fobj.fileno(), start, SEEK_HOLE)
            ranges.append((start, end))
        except OSError as e:
            if e.errno == ENXIO:
                return ranges

            raise

def main():
    if len(argv) < 2:
        print('Usage: %s <sparse_file>' % argv[0], file=stderr)
        raise SystemExit(1)

    try:
        with open(argv[1], 'rb') as f:
            ranges = get_ranges(f)
            for start, end in ranges:
                print('[%d:%d]' % (start, end))
                size = end-start
                length = min(20, size)
                f.seek(start)
                data = f.read(length)
                print(data)
    except OSError as e:
        print('Error:', e)
        raise SystemExit(1)

if __name__ == '__main__': main()

It probably doesn't do what you want, however, which is returning exactly the data you wrote. Zeroes may surround the returned data and must be trimmed by hand.

Current status of SEEK_DATA and SEEK_HOLE are described in https://man7.org/linux/man-pages/man2/lseek.2.html:

SEEK_DATA and SEEK_HOLE are nonstandard extensions also present in Solaris, FreeBSD, and DragonFly BSD; they are proposed for inclusion in the next POSIX revision (Issue 8).

like image 118
hdante Avatar answered Nov 15 '22 12:11

hdante