Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are recursive types really the only way to build noncontinuous arbitrary-size data structures?

I just noticed a question asking what recursive data types ("self-referential types") would be good for in C++ and I was tempted to boldly claim

It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.

That is, if you had no random-access arrays, you would require some means of references (logically) to a type within that type (obviously, instead of having a MyClass* next member you could say void* next but that would still point to a MyClass object or a derived type).

However, I am careful with absolute statements -- just because I couldn't think of something doesn't mean it's not possible, so am I overlooking something? Are there data structures that are neither organised using mechanisms similar to linked lists / trees nor using continuous sequences exclusively?


Note: This is tagged both c++ and language-agnostic as I'd be interested specifically in the C++ language but also in theoretical aspects.

like image 738
bitmask Avatar asked Aug 24 '12 16:08

bitmask


3 Answers

It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.

After contemplating for a while, this statement seems be correct. It is self-evident, in fact.

Suppose I've a collection of elements in a non-contiguous memory. Also suppose that I'm currently at element e. Now the question is, how would I know the next element in the collection? Is there any way?

Given an element e from a collection, there are only two ways to compute the location of the next element:

  • If I assume that it is at offset sizeof(e) irrespective of what e is, then it means that the next element starts where the current element ends. But then this implies that the collection is in a contiguous memory, which is forbidden in this discussion.

  • The element e itself tells us the location of the next element. It may store the address itself, or an offset. Either way, it is using the concept of self-reference, which too is forbidden in this discussion.

As I see it, the underlying idea of both of these approaches is exactly same: they both implement self-reference. The only difference is that in the former, the self-reference is implemented implicitly, using sizeof(e) as offset. This implicit self-reference is supported by the language itself, and implemented by the compiler. In the latter, it is explicit, everything is done by the programmer himself, as now the offset (or pointer) is stored in the element itself.

Hence, I don't see any third approach to implement self-reference. If not self-reference, then what terminology would one use to describe the computation of the location of the next element to an element e.

So my conclusion is, your statement is absolutely correct.

like image 177
Nawaz Avatar answered Nov 20 '22 18:11

Nawaz


The problem is that the dynamic allocator itself is managing contiguous storage. Think about the "tape" used for a Turing Machine, or the Von Neumann architecture. So to seriously consider the problem, you would likely need to develop a new computing model and new computer architecture.

If you think disregarding the contiguous memory of the underlying machine is okay, I am sure a number of solutions are possible. The first that comes to my mind is that each node of the container is marked with an identifier that has no relation to its position in memory. Then, to find the associated node, all of memory is scanned until the identifier is found. This isn't even particularly inefficient if given enough computing elements in a parallel machine.

like image 42
jxh Avatar answered Nov 20 '22 17:11

jxh


Here's a sketch of a proof.

Given that a program must be of finite size, all types defined within the program must contain only finitely many members and reference only finitely many other types. The same holds for any program entrypoint and for any objects defined before program initialisation.

In the absence of contiguous arrays (which are the product of a type with a runtime natural number and are therefore unconstrained in size), all types must be arrived at through the composition of types as above; derivation of types (pointer-to-pointer-to-A) is still constrained by the size of the program. There are no facilities other than contiguous arrays to compose a runtime value with a type.

This is a little contentious; if e.g. mappings are considered primitive then one can approximate an array with a map whose keys are the natural numbers. Of course, any implementation of a map must use self-referential data structures (B-trees) or contiguous arrays (hash tables).

Next, if the types are non-recursive then any chain of types (A references B references C...) must terminate, and can be of no greater length than the number of types defined in the program. Thus the total size of data referenceable by the program is limited to the product of the sizes of each type multiplied by the number of names defined in the program (in its entrypoint and static data).

This holds even if functions are recursive (which strictly speaking breaks the prohibition on recursive types, since functions are types); the amount of data immediately visible at any one point in the program is still limited to the product of the sizes of each type multiplied by the number of names visible at that point.

An exception to this is if you store a "container" in a stack of recursive function calls; however such a program would not be able to traverse its data at random without unwinding the stack and having to reread data, which is something of a disqualification.

Finally, if it is possible to create types dynamically the above proof does not hold; we could for example create a Lisp-style list structure where each cell is of a distinct type: cons<4>('h', cons<3>('e', cons<2>('l', cons<1>('l', cons<0>('o', nil))))). This is not possible in most static-typed languages, although it is possible in some dynamic languages e.g. Python.

like image 1
ecatmur Avatar answered Nov 20 '22 19:11

ecatmur