I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that:
The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.
The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.
The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.
The conclusions I have drawn are:
For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation
For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees
For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree?
I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases.
I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping.
Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html
B-Trees have nodes with more than one element. The leaves of a B-Tree have the same depth. Red-Black Tree leaves have the same “black” depth.
A B-tree should be able to outperform a red-black tree if you're doing it right (in particular, a B-tree should be faster if nodes fit into a cacheline.) It doesn't need to be contiguous in the page file; it merely needs to be contiguous in the process's virtual address space.
“n” is the total number of elements in the red-black tree. Comparison with AVL Tree: The AVL trees are more balanced compared to Red-Black Trees, but they may cause more rotations during insertion and deletion. So if your application involves frequent insertions and deletions, then Red-Black trees should be preferred.
Red-Black trees are very similar to a standard BST; however, they contain a few extra lines of code that describe a red and black node, as well as a few more operations. The coloured nodes allow for the data structure to be self-balanced.
A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values.
The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.)
To answer your cases,
For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With