Data structure and algorithm for representing/allocating free space in a file

Tags:

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.

I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?

For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.

My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.

300

asked Feb 01 '11 21:02

Helen Hunt

1 Answers

I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:

1) File header. Points at master type list. 2) Data. Each block has a header with type, index, logical size, and padded size. 3) Arrays of (offset, size) tuples for each given type. 4) Array of (type, offset, count) that keeps track of the types.

We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.

When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)

One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.

Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.

Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file! Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.

162

answered Sep 22 '22 18:09

Jon Watte

Related questions
                            
                                partial lookup in key-value map where key itself is a key-value map
                            
                                Why is substring searching using 'in' operator, faster than using KMP algorithm?
                            
                                Interleaving array {a1,a2,....,an,b1,b2,...,bn} to {a1,b1,a2,b2,a3,b3} in O(n) time and O(1) space
                            
                                Efficiently insert multiple elements in a list (or another data structure) keeping their order
                            
                                Optimizing a vector image by removing unnecessary points and stacking shapes
                            
                                Structure/algorithm for solving game with overlapping cards
                            
                                Do you have genetic algorithm in production?
                            
                                Teleporting Traveler, Optimal Profit over time Problem
                            
                                How would you go about tackling this exercise?
                            
                                How to determine the best case and worst case of an program(algorithm)?
                            
                                Is there "good" PRNG generating values without hidden state?
                            
                                Required Working Precision for the BBP Algorithm?
                            
                                Algorithm for activation key- Security
                            
                                Trilateration in a 2D plane with signal strengths
                            
                                Finding The Max of sum of elements in matrix in distinct rows and columns
                            
                                What is the concatenation complexity of balanced ropes?
                            
                                What's the name of this algorithm/routine?
                            
                                Can I calculate an element without looping through all preceding elements in my case (see the question body)?
                            
                                Generating synthetic social networks?
                            
                                Circle Separation Distance - Nearest Neighbor Problem

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data structure and algorithm for representing/allocating free space in a file

Tags:

algorithm

filesystems

data-structures

Helen Hunt

People also ask

1 Answers

Jon Watte

Recent Activity

Donate For Us