I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that: <ol> <li>The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.</li> <li>The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.</li> <li>The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.</li> </ol> The conclusions I have drawn are: For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree? I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases. I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping. Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values. The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.) To answer your cases, <ol> <li>Disk latency is roughly evenly split between seek time and waiting for the disk to rotate.</li> <li>A B-tree should be able to outperform a red-black tree if you're doing it right (in particular, a B-tree should be faster if nodes fit into a cacheline.)</li> <li>It doesn't need to be contiguous in the page file; it merely needs to be contiguous in the process's virtual address space. For sane OSes, it's also pretty much identical to case 1, unless your data is small enough that it mostly fits into memory and the memcpy overhead is significant.</li> </ol> For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.

Red Black Tree versus B Tree

Tags:

data-structures

large-data

red-black-tree

b-tree

file-mapping

I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that:

The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.
The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.
The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.

The conclusions I have drawn are:

For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation

For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees

For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree?

I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases.

I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping.

Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

429

asked Jun 19 '11 06:06

swanar

1 Answers

A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values.

The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.)

To answer your cases,

Disk latency is roughly evenly split between seek time and waiting for the disk to rotate.
A B-tree should be able to outperform a red-black tree if you're doing it right (in particular, a B-tree should be faster if nodes fit into a cacheline.)
It doesn't need to be contiguous in the page file; it merely needs to be contiguous in the process's virtual address space. For sane OSes, it's also pretty much identical to case 1, unless your data is small enough that it mostly fits into memory and the memcpy overhead is significant.

For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.

130

answered Sep 19 '22 13:09

tc.

Related questions
                            
                                Algorithm for detecting "clusters" of dots [closed]
                            
                                Is there a bug in java.util.Stack's Iterator?
                            
                                Applications of red-black trees
                            
                                Zipping lists of unequal size
                            
                                Breadth-first traversal
                            
                                What is a Memory-Efficient Doubly Linked List in C?
                            
                                LRU cache in Java with Generics and O(1) operations
                            
                                Scrabble tile checking
                            
                                Advanced data structures in practice
                            
                                Declaring and initializing a string array in VB.NET
                            
                                Trie vs. suffix tree vs. suffix array
                            
                                Removing last object of ArrayList in Java
                            
                                Why code-as-data?
                            
                                Memory Efficient Alternatives to Python Dictionaries
                            
                                Hashing a Tree Structure
                            
                                How do you write data structures that are as efficient as possible in GHC? [closed]
                            
                                Given a 1 TB data set on disk with around 1 KB per data record, how can I find duplicates using 512 MB RAM and infinite disk space?
                            
                                Rebalancing an arbitrary BST?
                            
                                AVL tree vs. B-tree
                            
                                What is an "internal node" in a binary search tree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With