gavinhewitt
4/13/2012 - 3:42 PM

Persistent tree structure using closure table in MySQL

Persistent tree structure using closure table in MySQL

Using closure tables to manage hierarchical relations in MySQL

Create DB tables

Create a table to represent tree nodes.

CREATE TABLE `tree_node` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `data_body` text,
    `node_deleted` datetime DEFAULT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

And create a closure table to manage the tree's structure.

CREATE TABLE `tree_hierarchy` (
    `ancestor` int(10) unsigned NOT NULL,
    `descendant` int(10) unsigned NOT NULL,
    PRIMARY KEY (`ancestor`,`descendant`),
    KEY `descendant` (`descendant`),
    CONSTRAINT `tree_hierarchy_ibfk_1` FOREIGN KEY (`ancestor`) REFERENCES `tree_node` (`id`),
    CONSTRAINT `tree_hierarchy_ibfk_2` FOREIGN KEY (`descendant`) REFERENCES `tree_node` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Multiple trees can be stored in these two tables, effectively creating a relational persistence container for a forest of trees. The closure table behaves as loosely coupled overlay, and additional tables could be used to represent alternative structure pertaining to the data.

Add/remove leaf nodes

To INSERT a leaf node into the tree we can make two DB queries: one into each table. We get a node ID for the newly created node and then use it as the <node_id>, which is the child of the <parent_node_id>.

INSERT INTO tree_node (data_body) VALUES ("Demo node data body.");

INSERT INTO tree_hierarchy (ancestor, descendant) SELECT ancestor, <node_id> FROM tree_hierarchy WHERE descendant=<parent_node_id> UNION ALL SELECT <node_id>, <node_id>;

And to delete a leaf node:

UPDATE tree_node SET node_deleted=NOW() WHERE id=<node_id>;

DELETE FROM tree_hierarchy WHERE descendant=<node_id>;

Get the full tree

We can query the tree_node table itself to perform a key-value look up an individual tree node or we can do a more complex query like the following to get an entire tree of elements.

SELECT
    tree_node.id,
    tree_node.node_deleted,
    tree_node.data_body,
    tree_hierarchy.ancestor AS parent_id
FROM tree_node
    JOIN tree_hierarchy ON tree_node.id=tree_hierarchy.ancestor
WHERE tree_hierarchy.descendant=<node_id>
    AND tree_node.node_deleted IS NULL
UNION
SELECT
    tree_node.id,
    tree_node.node_deleted,
    tree_node.data_body,
    tree_hierarchy.ancestor AS parent_id
FROM tree_node
    JOIN tree_hierarchy ON tree_node.id=tree_hierarchy.descendant
WHERE tree_hierarchy.ancestor=<node_id>
    AND tree_node.node_deleted IS NULL;

We don't need to use recursive logic to get a representation of the tree from the database. The SELECT query requires a non-trivial (but reasonably small) amount of work to return all nodes in the tree. It's then the program logic's responsibility to turn the results into a data structure representation that the application logic can use.

Note an interesting quirk: this technique enables getting an entire tree by specifying any node ID within the tree.

TODO: Other common tree operations

TODO: Take this further to explore the limitations or potential variations on use of this technique. Make note of how this solution scales or what it could be paired with for scaling (i.e. gizzard, etc.?).

What we have is essentially a key-value table paired with a mapping table that can be queried. I'm interested in seeing how this could be represented using distributed data stores, especially using data stores that emphasize key-value data and pairing various data store technologies to manage different types of data (i.e. store nodes in something like riak while storing the ancestry details as secondary indexes for the data stored in riak).

#references Slides 40-68 in http://www.slideshare.net/billkarwin/models-for-hierarchical-data slide deck. This technique is commonly used for modeling comment systems that support nested comments.