If I create a file, use lseek(2)
to jump to a high position in the (empty) file, then write some valuable information there, I create a sparse file on Unix system (probably depending on the file system I use, but let's assume I'm using a typical Unix file system like ext4 or similar, there this is the case).
If I then lseek(2)
to an even higher position in the file, write something there as well, I end up with a sparse file which contains somewhere in its middle the valuable information, surrounded by a huge amount of sparse file. I'd like to find this valuable information within the file without having to read it completely.
Example:
$ python
f = open('sparse', 'w')
f.seek((1<<40) + 42)
f.write('foo')
f.seek((1<<40) * 2)
f.write('\0')
f.close()
This will create a 2TB file which uses only 8k of disk space:
$ du -h sparse
8.0K sparse
Somewhere in the middle of it (at 1TB + 42 bytes) is the valuable information (foo
).
I can find it using cat sparse
of course, but that will read the complete file and print immense amounts of zero bytes. I tried with smaller sizes and found that this method will take about 3h to print the three characters on my computer.
The question is:
Is there a way to find the information stored in a sparse file without reading all the empty blocks as well? Can I somehow find out where empty blocks are in a sparse file using standard Unix methods?
Just writing an answer based on the previous comments:
#!/usr/bin/env python3
from errno import ENXIO
from os import lseek
from sys import argv, stderr
SEEK_DATA = 3
SEEK_HOLE = 4
def get_ranges(fobj):
ranges = []
end = 0
while True:
try:
start = lseek(fobj.fileno(), end, SEEK_DATA)
end = lseek(fobj.fileno(), start, SEEK_HOLE)
ranges.append((start, end))
except OSError as e:
if e.errno == ENXIO:
return ranges
raise
def main():
if len(argv) < 2:
print('Usage: %s <sparse_file>' % argv[0], file=stderr)
raise SystemExit(1)
try:
with open(argv[1], 'rb') as f:
ranges = get_ranges(f)
for start, end in ranges:
print('[%d:%d]' % (start, end))
size = end-start
length = min(20, size)
f.seek(start)
data = f.read(length)
print(data)
except OSError as e:
print('Error:', e)
raise SystemExit(1)
if __name__ == '__main__': main()
It probably doesn't do what you want, however, which is returning exactly the data you wrote. Zeroes may surround the returned data and must be trimmed by hand.
Current status of SEEK_DATA and SEEK_HOLE are described in https://man7.org/linux/man-pages/man2/lseek.2.html:
SEEK_DATA and SEEK_HOLE are nonstandard extensions also present in Solaris, FreeBSD, and DragonFly BSD; they are proposed for inclusion in the next POSIX revision (Issue 8).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With