What is a sparse file and why do we need it? The only thing that I am able to get is that it is a very large file and it is efficient(in gigabytes). How is it efficient ?

A sparse file is a file that is mostly empty, i.e. it contains large blocks of bytes whose value is <code>0</code> (zero). On the disk, the content of a file is stored in blocks of fixed size (usually 4 KiB or more). When all the bytes contained in such a block are <code>0</code>, a file system that implements sparse files does not store the block on disk, instead it keeps the information somewhere in the file meta-data. Advantages of using sparse files: <ul> <li>empty blocks of data do not occupy disk space; they are not stored as the regular blocks of data, their identifiers (that use only several bytes) are stored instead in the file meta-data; this way 4 KiB of disk space (or more) are saved for each empty block;</li> <li>reading an empty block of data from a sparse file does not take time; this happens because no data is read from disk; since the file system knows all the bytes in the block are <code>0</code>, it just sets to <code>0</code> all the bytes in the input buffer and the data is ready; there is no need to access the slow storage device;</li> <li>writing an empty block of data into a sparse file does not take time; on writing, the file system detects that the block is empty (all its bytes are <code>0</code>) and puts the block ID into the list of empty blocks (in the file meta-data); no data is written to the disk.</li> </ul> More information about sparse files can be found on the Wikipedia page.

What is a sparse file and why do we need it?

2 Answers

Say you have a file with many empty bytes \x00. These many empty bytes \x00 are called holes. Storing empty bytes is just not efficient, we know there are many of them in the file, so why store them on the storage device? We could instead store metadata describing those zeros. When a process reads the file those zero byte blocks get generated dynamically as opposed to being stored on physical storage (look at this schematic from Wikipedia):

Sparse File - Wikipedia

This is why a sparse file is efficient, because it does not store the zeros on disk, instead it holds enough data describing the zeros that will be generated.

Note: the logical file size is greater than the physical file size for sparse files. This is because we have not stored the zeros physically on a storage device.

Edit:

When you run:

$ dd if=/dev/zero of=output bs=1G count=4

The command here copies 4G blocks of null bytes to output. To see that:

$ stat output File: ouput   Size: 4294967296      Blocks: 8388616    IO Block: 4096   regular file --omitted--

You can see that this file has 8388616 blocks allocated to it, these blocks store nothing but empty bytes copied from /dev/zero and they do occupy physical disk space, they're holes stored on disk (sparse zeros). dd did what you asked for, copying blocks of data from one file to another.

Now, run this command to detect the holes and make the file sparse in-place:

$ fallocate -d output $ stat output File: swapfile   Size: 4294967296      Blocks: 0          IO Block: 4096   regular file --omitted--

Do you notice something? The the number of blocks now is 0 because the blocks that were storing only empty bytes were de-allocated. Remember, output's blocks store nothing, only a bunch of empty zeros, fallocate -d detected the blocks that contain only empty zeros and deallocated them, since all the blocks for this file contain zeros, they were all de-allocated.

Also notice how the size remained the same. This is the logical (virtual) size of the file, not its size on disk. It's crucial to know that output doesn't occupy physical storage space now, it has 0 blocks allocated to it and thus I doesn't really use disk space. The size preserved after running fallocate -d so when you later read from the file, you get the empty bytes generated to you by the filesystem at runtime. The physical size of output however, is zero, it uses no data blocks.

Remember, when you read output file the empty bytes are generated by the filesystem at runtime dynamically, they're not really physically stored on disk, and the file's size as reported by stat is the logical size, and the physical size is zero for output. In this case the filesystem has to generate 4G of empty bytes when a process reads the file.

To generate a sparse file using dd:

$ dd if=/dev/zero of=output2 bs=1G seek=0 count=0 $ stat  stat output2   File: output2   Size: 4294967296      Blocks: 0          IO Block: 4096   regular file

GNU dd internally uses lseek and ftruncate, so check truncate(2) and lseek(2).

152

answered Jan 06 '23 23:01

direprobs

A sparse file is a file that is mostly empty, i.e. it contains large blocks of bytes whose value is 0 (zero).

On the disk, the content of a file is stored in blocks of fixed size (usually 4 KiB or more). When all the bytes contained in such a block are 0, a file system that implements sparse files does not store the block on disk, instead it keeps the information somewhere in the file meta-data.

Advantages of using sparse files:

empty blocks of data do not occupy disk space; they are not stored as the regular blocks of data, their identifiers (that use only several bytes) are stored instead in the file meta-data; this way 4 KiB of disk space (or more) are saved for each empty block;
reading an empty block of data from a sparse file does not take time; this happens because no data is read from disk; since the file system knows all the bytes in the block are 0, it just sets to 0 all the bytes in the input buffer and the data is ready; there is no need to access the slow storage device;
writing an empty block of data into a sparse file does not take time; on writing, the file system detects that the block is empty (all its bytes are 0) and puts the block ID into the list of empty blocks (in the file meta-data); no data is written to the disk.

More information about sparse files can be found on the Wikipedia page.

answered Jan 06 '23 21:01

axiac

Related questions
                            
                                Set DNS options during docker build
                            
                                Matplotlib set_color_cycle versus set_prop_cycle
                            
                                How to trigger a change event manually - angular2
                            
                                Can I modify container's environment variables without restarting pod using kubernetes
                            
                                What is the advantage of using an InputLayer (or an Input) in a Keras model with Tensorflow tensors?
                            
                                Why is it faster to perform float by float matrix multiplication compared to int by int?
                            
                                Can Firebase Cloud Storage rules validate against Firestore data?
                            
                                Is there an XML schema extension for Visual Studio Code?
                            
                                Can't bind to 'icon' since it isn't a known property of 'fa-icon'
                            
                                How to uninstall Elm package?
                            
                                Angular 6 building a library with assets
                            
                                Why examples don't work? (a struggle with imports)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is a sparse file and why do we need it?

Tags:

Luv33preet

People also ask

2 Answers

direprobs

axiac

Recent Activity

Donate For Us