This post is very delayed, but things have been busy.
I'm working my way up to finger trees, which are a wonderful
functional data structure. They're based on trees - and like many
tree-based structures, their performance relies heavily on the balance
of the tree. The more balanced the tree is, the better they perform.
In general, my preference when I need a balanced tree is a red-black
tree, which I wrote about before. But there are other kinds of balanced trees,
and for finger trees, many people prefer a kind of tree called a 2/3 tree.
A two-three tree is basically a B-tree with a maximum of two values per node.
If you don't know what a B-tree is, don't worry. I'm going to explain all the details.
A two-three tree is a tree where each node contains either one or two values. If
the node isn't a leaf, then its number of children is one more than its number
of values. The basic value constraints are the same as in a binary search tree:
smaller values to the left, larger values to the right.
The big difference in the 2/3 tree is balance: it's got a much stronger
balance requirement. In a 2/3 tree, every path from the tree root to the leaves
is exactly the same length.
Before getting into the insert algorithm (which is where the beauty and
the trickiness of the structure resides), let's get a bit more formal about what a two-three tree is. The rules of a two-three
- Every node has either one or two values.
- Every node has either 0, 2, or 3 children.
- For a given node n, the 1st child subtree contains only values
less than the 1st value of n; the 2nd child subtree contains values
greater than or equal to the 1st value of n but less than the second;
and the 3rd child subtree contains values greater than or equal to
the second value in n.
- The leaves of the tree are all on the same level.
How can it do that? The concept is pretty simple, but the implementation is
tricky. (B-trees in general are incredibly error-prone to implement. There's a lot
of subtleties to get the implementation correct. They're not difficult, but there
are a lot of tricky cases.)
To insert a node, what you do is a straightforward search for the correct leaf to
insert it into. Then you insert it into the correct position in that leaf. If after
doing that, the leaf has three values, then you take that node, and split it. You take
the middle value, and make a new single-value tree node. Then you take the smallest value
in the node, make it into a single-value node, and make it the left child of the middle node. Similarly, with the largest value, you make a new single value node, and make it the right child of the middle node. That gives you a simple three-node two-three tree where you
used to have a single node.
Now you need to re-attach it to the tree. To do that, you go to the parent of the node where you did the insert. You remove the pointer that used to point to the node where you did the insert, and then you splice in the new node at that position. If that results in
the parent node having more than two values, then you split the parent, and continue.
As usual, this is much clearer with an example. Here's a two-three tree.
Now, suppose we want want to insert the value 15. The insert node is going to be the third child - the rightmost node in the tree, which has two values, 14 and 17. So we go ahead and insert it into that node.
Now we've got a leaf node with three values, 14, 15, and 17. That's no good. So we split
it out. We make a new node with the value 15, and make it a left-child node containing 14, and a right child containing 17. Then we insert that into the parent.
So now we've got a root node with three values: 6, 12, and 15. So we need to split it.
We do the same basic thing: take the middle value (12), and make it a new node. Then we
make 6 its left child, and 15 its right. And now we've got a two-three tree that satisfies all the rules, and is beautifully balanced.
Deleting a node uses similar tricks, but based on merging nodes instead of splitting.
The trick in deletes is making sure that you don't lose levels. For example, suppose
you wanted to remove "6".
Looking at our tree, if you deleted "6", the easiest way would be to merge 4 and 7 into a single node, which was the left child of 12, as in this diagram.
But when you do that, you've removed a level from the leftmost branch of the tree: the left branch of the root node would have a depth of two, and the right branch would have a depth of three. So to do the delete successfully, you need to shuffle things around a bit. Since you're removing a level from the left subtree, you need to also remove one from the right. You do that by merging the root of the right subtree into the tree root:
Finally, if the merge of the right branch had caused the root to contain three values, we would have needed to do a split, exactly the way we did in insert.
It's a beautiful structure. But as I said, implementing it is a pain. Personally,
I'd rather take the relatively small performance penalty of the red-black tree,
which I find a lot easier to get right. In general, two-three trees are beautiful in concept,
but more trouble than they're worth in practice. But at least some functional folks love
them; and I suppose it's possible that in a functional language with strong formal
properties, it's easier to get them right than in a typical imperative language.
I'm not going to show you an implementation of two-three trees - as I said, the implementation gets rather hairy, covering all of the cases. I put one together, but looking at it, I don't think it's got much in the way of explanatory value. In the next data structure post, I'm going to finally talk about finger trees. One of the canonical implementations of them is based on a Haskell implementation of two-three trees - so if you really want to
see that, there'll be a link in the next post.
Funny, I JUST learned both red black tree and 2-3 trees this week in data structures course in uni.
I'm surprised you say red black trees are simpler than 2-3 trees... I have not tried implementing either, but 2-3 trees seem incredibly simpler to me, and have far far less cases to deal with.. The red-black tree we learned in class had 3 different cases of behavior in insert, involving uncles and other complex relationships, and 4 different cases for removing.. each case being actual different behaviors, not just different positioning... (meaning, each of the 7 cases also had subcases of left/right...)
The 2-3 tree afterward seemed astoundingly simpler to me.. But perhaps I am underestimating, as I have not tried implementing either, and in implementation it could come out much more complicated... But the theory alone of the 2-3 trees is so much simpler than it seems than in any way it should be easier to implement than red-black trees...
B-trees have always been one of my favorite structures. Always balanced and fast, great paging behavior. Complicated but rewarding. When I first discovered them it was love at first sight.
Wikipedia has a pretty good description. http://en.wikipedia.org/wiki/B-tree
Thanks for the post. I've never heard about 2-3 tree before I read about them here. As Oded pointed out it looks simple, but when I started implemented it (in Java), I understood the pain you talked about. But, after a bit of suffering I got it to work. I uploaded my implementation to http://code.google.com/p/twothreetree
I also used this http://cs.wellesley.edu/~cs230/spring07/2-3-trees.pdf document for reference.
I am looking forward to post about finger trees.
Thanks for the post, Mark. I appreciate you taking the time to provide detailed illustrations.
Does anyone have performance data comparing 2-3 trees with other tree algorithms? I would think they perform better in lookup-intensive workloads given the smaller mean depth. (Hint, hint, to any CS students looking for projects.)
I haven't done any experiments myself, but I vaguely remember reading a paper with a comparison of various different search trees. 2/3 trees did perform very slightly better than red-black trees in search, but not as good as AVL trees. But with insert added into the mix, the costs balanced out so that AVL trees were overall the worst, with red/black and 2/3 being roughly equivalent.
Like I said, I don't remember the paper particularly well. I might try looking for it in my files, and then writing something about it. But my intuition is that the 2/3 trees are shallower than the red/blacks, but searching a node in a red-black can require 2 comparisons instead of 1 in the red/black. I think that the reduction in depth of the 2/3 will roughly balance with the increased comparisons due to multiple values in the internal nodes.
Of course, that's just intuition, which isn't really worth the bits it's written on. The best thing would be to do a head-to-head comparison on a variety of largish data-sets. I've got some really great implementation of red-black trees in C++ and Haskell, and a 2/3 tree in Haskell. But Haskell performance is always a bit iffy, which makes me uncomfortable using it for a supposedly canonical performance comparison. If anyone can point me at solid implementations of both red/black and 2/3 trees in C, C++, or Java, I'd be glad to run the comparison.