Querying Tree Structures in SQLite using Python and the Transitive Closure Extension

photos/fall-foliage.jpg

I recently read a good write-up on tree structures in PostgreSQL. Hierarchical data is notoriously tricky to model in a relational database, and a variety of techniques have grown out of developers' attempts to optimize for certain types of queries.

In his post, Graeme describes several approaches to modeling trees, including:

In the comments, some users pointed out that the ltree extension could also be used to efficiently store and query materialized paths. LTrees support two powerful query languages (lquery and ltxtquery) for pattern-matching LTree labels and performing full-text searches on labels.

One technique that was not discussed in Graeme's post was the use of closure tables. A closure table is a many-to-many junction table storing all relationships between nodes in a tree. It is related to the adjacency model, in that each database row still stores a reference to its parent row. The closure table gets its name from the additional table, which stores each combination of ancestor/child nodes.

Here's a simple category hierarchy you might see on an e-commerce site. Like the adjacency model, each category stores a reference to its immediate parent category:

The closure table stores references between all ancestors and their descendants, and the relative depth of the relationship.

To further illustrate these relationships, let's pretend our online bookstore database has the following categories:

Here is an outline of the data stored in the closure table:

Ancestor Descendant Depth Description
110Self-reference
220Self-reference
330Self-reference
440Self-reference
550Self-reference
121Books - Fiction
151Books - Nonfiction
132Books - Fiction - Sci-Fi
142Books - Fiction - Westerns
231Fiction - Sci-Fi
241Fiction - Westerns

As you can see, the full ancestry of every category is included in the table, including a denormalized depth value. This can lead to some very efficient queries at the cost of extra storage (O(n^2) worst).

While there are numerous benefits to this approach, it is more complex than a simple materialized path or adjacency model. Luckily, for SQLite users, there is a C extension which will manage the maintenance of the closure table for you!

Transitive Closure Extension

Buried deep within the SQLite source tree is an extension module named closure.c. I was browsing the SQLite source a while back and decided to give this module a try, and was pleasantly surprised by how seamlessly it worked.

Here's how it works. You create your adjacency-model table as usual, then create a virtual table using the transitive_closure extension. The virtual table needs to be initialized with the following parameters:

Once the extension is initialized, you can perform queries using the closure table without having to worry about updating or otherwise managing the denormalized data. The transitive_closure extension maintains an AVL tree behind-the-scenes.

Building the Extension

In order to get started, we will need to retrieve and compile the extension. The extension can be found online in SQLite's version control repo. To just retrieve the relevant C file, you can clone this private Gist:

$ git clone https://gist.github.com/coleifer/7f3593c5c2a645913b92 closure

Once you have the source file closure.c, compile a shared library:

$ gcc -g -fPIC -shared closure.c -o closure.so

Congratulations! Now we're ready to start using the extension.

Basic queries

In the same directory as the shared library, fire up the SQLite shell and load the extension using the .load command:

$ sqlite3
SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> .load ./closure

We'll create the categories table and data from the bookstore example presented earlier in the post. The following statements create a normal adjacency-model table, and a transitive closure virtual table:

CREATE TABLE category (
  id INTEGER NOT NULL PRIMARY KEY,
  name VARCHAR(255),
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES category (id));

CREATE VIRTUAL TABLE category_closure USING transitive_closure (
  tablename="category",
  idcolumn="id",
  parentcolumn="parent_id");

Next we will insert the data into the category table. There is nothing special about these statements, they are normal INSERT queries:

INSERT INTO category (name, parent_id) VALUES ('Books', NULL);
INSERT INTO category (name, parent_id) VALUES ('Fiction', 1);
INSERT INTO category (name, parent_id) VALUES ('Sci-fi', 2);
INSERT INTO category (name, parent_id) VALUES ('Western', 2);
INSERT INTO category (name, parent_id) VALUES ('Non-fiction', 1);

Note that we are not inserting any values into the category_closure table. The category_closure table will automatically populate based on the values stored in the category table.

No additional steps are necessary in order to start using the closure table. We can simply query category_closure, specifying a combination of the root or depth we are interested in.

To query the direct descendants of a given category, we will query the closure table for nodes rooted at the given category with a relative depth of 1:

sqlite> SELECT name FROM category WHERE id IN (
   ...>   SELECT cc.id
   ...>   FROM category_closure AS cc
   ...>   WHERE cc.root = 1 AND cc.depth = 1);

name
-----------
Fiction
Non-fiction

To query all descendants of a given category, simply leave off the depth filter:

sqlite> SELECT name FROM category WHERE id IN (
   ...>   SELECT cc.id
   ...>   FROM category_closure AS cc
   ...>   WHERE cc.root = 1);

name
-----------
Books
Fiction
Sci-fi
Western
Non-fiction

An interesting aspect of closure tables is that self-references are stored with a depth of zero. That is why we see the Books category itself included as a descendant.

Using Transitive Closure with Peewee ORM

In the latest release of peewee ORM (2.4.3), I've added a factory function for creating virtual models suitable for working with the closure table extension. I've also added documentation for working with closure tables in peewee.

If you're new to peewee, it's a lightweight ORM that aims to provide an expressive, Pythonic interface for executing SQL queries. Here is how you would define the category relationships using peewee and the transitive-closure extension:

from peewee import *
from playhouse.sqlite_ext import *

db = SqliteExtDatabase('categories.db')
db.load_extension('/path/to/closure')  # Note we do not include the ".so".

class Category(Model):
    name = CharField()
    parent = ForeignKeyField('self', null=True, related_name='children')

    class Meta:
        database = db

CategoryClosure = ClosureTable(Category)

# Create the tables if they do not exist already.
Category.create_table(True)
CategoryClosure.create_table(True)

To populate the database with our bookstore data, we can create a handful of Category model instances:

books = Category.create(name='Books')
fiction = Category.create(name='Fiction', parent=books)
Category.create(name='Sci-fi', parent=fiction)
Category.create(name='Westerns', parent=fiction)
Category.create(name='Non-fiction', parent=books)

Let's run a query on the sample data using our model classes. The process is very similar to executing queries with the SQLite shell. The following query retrieves all the nodes which are descendants of the Books category at any depth:

subquery = (CategoryClosure
            .select(CategoryClosure.id)
            .where(CategoryClosure.root == books))
for descendant in Category.select().where(Category.id << subquery):
    print descendant.name

Common queries

In this section I'll show some common queries and how they can be expressed using the closure table. In some cases these queries can be implemented without the use of the closure table, so where that's the case I'll show both options. We'll assume that the node we're interested in is named source_node.

Get all descendant nodes

Given a node, retrieve all nodes below it in the tree at any depth.

descendants = (Category
               .select()
               .join(
                   CategoryClosure,
                   on=(Category.id == CategoryClosure.id))
               .where(CategoryClosure.root == source_node))

# Using a subquery instead. "<<" translates to "IN".
subquery = (CategoryClosure
            .select(CategoryClosure.id)
            .where(CategoryClosure.root == source_node))
all_descendants = Category.select().where(Category.id << subquery)

# Using the helper method.
all_descendants = CategoryClosure.descendants(
    source_node, include_node=True)

To exclude source_node in the list of child nodes, add a depth > 0 clause to the where clause or subquery:

descendants = (Category
               .select()
               .join(
                   CategoryClosure,
                   on=(Category.id == CategoryClosure.id))
               .where(
                   (CategoryClosure.root == source_node) &
                   (CategoryClosure.depth > 0)))

# Using helper method.
descendants = CategoryClosure.descendants(source_node, include_node=False)

Get all direct descendant nodes

Given a node, retrieve its child nodes.

# We can use just the Category table in this case.
direct_descendants = Category.select().where(Category.parent == source_node)

# We can join on the closure table.
direct_descendants = (Category
                      .select()
                      .join(
                          CategoryClosure,
                          on=(Category.id == CategoryClosure.id))
                      .where(
                          (CategoryClosure.root == source_node) &
                          (CategoryClosure.depth == 1)))

# We can use a subquery.
subquery = (CategoryClosure
            .select(CategoryClosure.id)
            .where(
                (CategoryClosure.root == source_node) &
                (CategoryClosure.depth == 1)))
direct_descendants = Category.select().where(Category.id << subquery)

# Using helper method.
direct_descendants = CategoryClosure.descendants(source_node, depth=1)

Note that this technique can be used to retrieve descendant nodes at any depth. Thus you could specify depth == 2 to retrieve all grandchildren.

Get all sibling nodes

Given a node, retrieve all nodes at the same depth, with the same parent.

# We can use just the Category table.
siblings = Category.select().where(Category.parent == source_node.parent)

# Or use the closure table.
siblings = (Category
            .select()
            .join(CategoryClosure, on=(Category.id == CategoryClosure.id))
            .where(
                (CategoryClosure.root == source_node.parent) &
                (CategoryClosure.depth == 1)))

# Using helper method.
siblings = CategoryClosure.siblings(source_node)

Get all ancestors

Given a node, retrieve all nodes from which it is descended.

# Using a JOIN.
ancestors = (Category
             .select()
             .join(
                 CategoryClosure,
                 on=(Category.id == CategoryClosure.root))
             .where(CategoryClosure.id == source_node))

# Using multiple tables in the FROM clause.
ancestors = (Category
             .select()
             .from_(Category, CategoryClosure)
             .where(
                 (Category.id == CategoryClosure.root) &
                 (CategoryClosure.id == source_node)))

# Using helper method.
ancestors = CategoryClosure.ancestors(source_node)

IPython Notebook

If you'd like to explore this code in more detail, I've put together an IPython notebook containing these examples and a bit more code. You can view the notebook here:

http://nbviewer.ipython.org/gist/coleifer/692c69724f08dc1dc8cc

I also included some basic benchmarks of the extension compared with a simple materialized path model. The benchmarks test the time it takes to build a tree using both the closure table and an indexed materialized path. Then, the benchmark will execute a series of SELECT queries at various depths in the tree to retrieve direct descendants and all descendants. I tried various tree structures, such as very deep trees, broad trees, and uniform trees. In my testing, the closure table performed better in every case, but these benchmarks are far from scientific -- rather they're there if you'd like to poke at the code and test it yourself.

Quick note on recursion

I thought I would quickly mention that it is also possible to use recursion and recursive CTEs to query hierarchical data-structures. Here is an example of how you might retrieve the tree structure, including the depth:

WITH RECURSIVE cte_categories (id, name, parent_id, depth) AS (
    SELECT id, name, parent_id, 1
    FROM category
    WHERE parent_id IS NULL -- get root nodes
    UNION ALL
    SELECT c.id, c.name, c.parent_id, r.depth + 1
    FROM category AS c
    INNER JOIN cte_categories AS r ON (c.parent_id = r.id)
)
SELECT name, depth, parent_id
FROM cte_categories ORDER BY depth, name;

If you're interested in learning more about recursive SQL queries, the SQLite documentation provides some good examples.

Thanks for reading

Thanks for taking the time to read this post, I hope you found it interesting. The transitive-closure extension is a pretty cool library and could definitely be useful if you're doing some serious hierarchical modeling with SQLite. If you have any questions or comments, please feel free to leave a comment or contact me.

Here are some links which you may find helpful:

Here are some blog posts on related topics:

photos/tree-squirrel.jpg

Do you see the little friend perched on the branch?

Comments (0)


Commenting has been closed, but please feel free to contact me