Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.

I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?

For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.

My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.

like image 300
Helen Hunt Avatar asked Feb 01 '11 21:02

Helen Hunt


People also ask

What data structure does file system use?

There are two important data structures used to represent information about a virtual file system, the vfs structure and the v-node. Each virtual file system has a vfs structure in memory that describes its type, attributes, and position in the file tree hierarchy.

Which data structure is best for file directory and asked me to design it?

Answer. Answer: A tree structure is the most common directory structure. B+ tree is best for keeping track of directories and files in a computer.

Which data structures is most efficient for in memory text file viewer?

Gap Buffer is a data structure used for editing and storing text in an efficient manner that is being currently edited. It is also similar to an array but a gap is introduced in the array for handling multiple changes at the cursor.

What is file in data structure and algorithm?

A file structure is a combination of representations for data in files. It is also a collection of operations for accessing the data. It enables applications to read, write, and modify data. File structures may also help to find the data that matches certain criteria.


1 Answers

I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:

1) File header. Points at master type list. 2) Data. Each block has a header with type, index, logical size, and padded size. 3) Arrays of (offset, size) tuples for each given type. 4) Array of (type, offset, count) that keeps track of the types.

We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.

When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)

One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.

Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.

Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file! Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.

like image 162
Jon Watte Avatar answered Sep 22 '22 18:09

Jon Watte