Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do disk pointers work?

Suppose I want to store a complicated data structure (a tree, say) to disk. The internal pointers which connect nodes in my data structures are pointers, but I can't just write these pointers to disk, because when I read the data structure back the memory locations will have changed.

So what is the right way to store the pointers on disk? Is the answer as simple as (File, Offset), or is there something that I'm missing? I can intuit how pointers might be converted to (File, offset) pairs, and back again, but are there some subtleties that I should watch out for?

Edit: I should mention that I'm especially interested in how a database would do this internally, for a b-tree. I probably made the question more general than I should have, though I do appreciate the XML-based answers.

like image 973
Rob Lachlan Avatar asked Jan 10 '10 17:01

Rob Lachlan


People also ask

What is the use of disk pointer?

Accessing nodes of a binary search tree stored on disks using (file, offset) pointer would be orders of magnitude slower than accessing them in memory. If speed of access is important, you'd want to store things which are expected to accessed together, closer together on disks.

Where are file pointers stored?

The structure is in memory, and the pointer points at the structure in memory.

What does a file pointer contain?

Answer: File pointer is a pointer which is used to handle and keep track on the files being accessed. A new data type called “FILE” is used to declare file pointer. This data type is defined in stdio.


2 Answers

Your intutuion about (file, offset) pairs is correct.

An important thing to watch out for when storing data on disks is that, disks are slow. So, there are special data structures which have been designed to store "searchable" data on disks. Accessing nodes of a binary search tree stored on disks using (file, offset) pointer would be orders of magnitude slower than accessing them in memory.

If speed of access is important, you'd want to store things which are expected to accessed together, closer together on disks. A couple of data structures used for this are B-tree and B+ tree. Look these up, to find out how to use them. There are complicated caching algorithms used by several applications such as databases, to cache things in memory, so that apps do not need to go to disk to retrieve stuff again and again.

If speed of access is not important, then simply "serializing" data on disk in the form of XML as suggested by Aiden and Darren is good enough.

Edit: If you need more details about how databases store data on disk, you'd need to learn more about database theory. I'd suggest reading up a good book on databases, so that you understand the requirements that drive the disk format. Note that I am mostly referring to relational databases here, but there are other breeds of databases, which have completely different requirements and hence different disk formats. Starting with relational databases is a good thing to do though, since they are most commonly used.

In short a few things that affect relational database disk format are:

  1. Disk read/write performance
  2. Database recovery (in case of corruption)
  3. Relations between entities
  4. Garbage collection
  5. Transactional support
  6. Primary index

Query optimization is an important branch of database theory to optimize disk accesses, for satisfying a query. Hopefully, this will get you started in the right direction.

like image 124
Sudhanshu Avatar answered Sep 27 '22 16:09

Sudhanshu


Anyway you like. You could store it as references to other files on-top of a filesystem for each node, or write a filesystem driver that uses block references.

Providing:

  1. Your nodes contain references to locations that persist
  2. You can know when writing a node what locations to write

You can do it any way you wish. Filesystems are trees that use a disk-based inode system.

You could always use a single file with a header and use byte-offsets stored as unsigned ints or values that map onto ints. inside the file to denote the start of some node ... then have an end-of-record at the end of each node.

You could also use XML files with references to other locations or a single file and XPath/XPointers.

<Node id="someNode">
    <value>...</value>
    <children>
        <child xpath="/node[id=1]" />
        <child xpath="/node[id=29]" />

But this would mean serializing your values into characters if they are just binary blobs (eww) Your value could be a path of a binary chunk just written to a file such as:

<value>/path/to/mappable.bin</value>

Check out anything from XML encapsulation through to filesystems written in C for a whole gamut of tree implementations.

This XML solution might be bloated, but is simple enough if you don't need speed. Just an example of a high-level approach. Tree storage is an age-old problem, with solutions at all levels.

Trees is trees.

like image 44
Aiden Bell Avatar answered Sep 27 '22 16:09

Aiden Bell