Extending SQLite with Python

December 02, 2014 21:33 / peewee python sqlite / 1 comments

SQLite is an embedded database, which means that instead of running as a separate server process, the actual database engine resides within the application. This makes it possible for the database to call directly into the application when it would be beneficial to add some low-level, application-specific functionality. SQLite provides numerous hooks for inserting user code and callbacks, and, through virtual tables, it is even possible to construct a completely user-defined table. By extending the SQL language with Python, it is often possible to express things more elegantly than if we were to perform calculations after the fact.

In this post I'll describe how to extend SQLite with Python, adding functions and aggregates that will be callable directly from any SQL queries you execute. We'll wrap up by looking at SQLite's virtual table mechanism and seeing how to expose a SQL interface over external data sources.

Getting started

The following examples will use peewee ORM. If you're new to peewee, it's a lightweight ORM that aims to provide an expressive, Pythonic interface for executing SQL queries. In order to get started, we will use the peewee ORM sqlite extension. This extension provides a database class that exposes the SQLite hooks for registering custom functions and aggregates.

If you'd like to follow along, you can install peewee using pip:

$ pip install peewee

Then start an interactive shell and import the SQLite extension module. We will use an in-memory SQLite database:

from playhouse.sqlite_ext import *

db = SqliteExtDatabase(':memory:')

Functions

The simplest way we can extend the SQL language is by defining our own functions. A function operates on a single row of data and returns a single value. Many databases provide a standard set of functions for performing operations like transforming text to upper or lower-case, extracting substrings, generating random numbers, and manipulating dates.

Examples of SQL functions:

print db.execute_sql('SELECT random()').fetchone()
# OUTPUT: (-2523015414799391104,)

print db.execute_sql('SELECT strftime("%Y-%m-%d", "now")').fetchone()
# OUTPUT: ('2014-12-02',)

You can also integrate SQL functions with peewee models:

class TextData(Model):
    value = TextField()

    class Meta:
        database = db

TextData.create_table()

# Populate the table with some strings.
for value in ['Foo', 'Bar']:
    TextData.create(value=value)

# Use SQLite's lower-case and upper-case functions:
query = (TextData
         .select(fn.lower(TextData.value), fn.upper(TextData.value))
         .tuples())
print [row for row in query]
# OUTPUT: [('foo', 'FOO'), ('bar', 'BAR')]

Creating our own functions

SQLite provides a hook for registering our own functions. Using the peewee ORM, let's take a look at how easy it is to add a new function to SQLite. For our example, we will create a function that generates a URL-friendly representation of a title string, a process called slugifying.

import re

@db.func()
def slugify(title):
    return re.sub('[^\w]+', '-', title.lower()).strip('-')

The slugify function will take any non-alphanumeric characters, and replace them with a dash. Finally, any leading or trailing dashes are removed. Here are some example function calls:

slugify('Hello world') becomes 'hello-world'
slugify('My Awesome Post!') becomes 'my-awesome-post'

This function can be used to select, along with the title, the slugified version, which can be used in HTML templates to generate a URL-friendly link to the post. Similarly, we can use the function in the where clause to locate a post matching a given slug.

class Post(Model):
    title = CharField()
    content = TextField(default='')

    class Meta:
        database = db

Post.create_table()
Post.create(title='Extending SQLite with Python')
Post.create(title='SQLite: Small. Fast. Reliable.')

# Get a list of posts, along with the slugified title.
posts = Post.select(Post, fn.slugify(Post.title).alias('slug'))
for post in posts:
    print post.title, '-->', post.slug

# OUTPUT:
# Extending SQLite with Python --> extending-sqlite-with-python
# SQLite: Small. Fast. Reliable. --> sqlite-small-fast-reliable

# Find the post matching a given slug.
slug = 'sqlite-small-fast-reliable'
post = Post.select().where(fn.slugify(Post.title) == slug).get()

One other cool aspect of functions is that they can be composed with other functions. Imagine we want to display a list of posts by the first letter of their title. To get the list of all posts that start with the letter S we could write:

s_posts = (Post
           .select()
           .where(fn.lower(fn.substr(Post.title, 1, 1)) == 's'))

In the peewee source code you can find other examples of functions which perform date-manipulation.

Aggregate functions

Aggregate functions operate on one or more rows of data, performing some calculation and returning a single aggregate value. If you've ever queried for the count of objects in a table, then you've used an aggregate function. The databases I'm most familiar with offer a pretty standard set of built-in aggregates:

COUNT
SUM
MIN
MAX
AVG
GROUP_CONCAT (this joins multiple strings with a provided delimiter)

Example code:

class IntData(Model):
    value = IntegerField()

    class Meta:
        database = db

IntData.create_table()

# Populate the table with numbers 1 - 10.
(IntData
 .insert_many({'value': value} for value in range(1, 11))
 .execute())

# Perform some simple aggregate queries.
print IntData.select(fn.Min(IntData.value)).scalar()
# OUTPUT: 1

print IntData.select(fn.Max(IntData.value)).scalar()
# OUTPUT: 10

print IntData.select(fn.Sum(IntData.value)).scalar()
# OUTPUT: 55

print IntData.select(fn.Group_Concat(IntData.value, '--')).scalar()
# OUTPUT: 1--2--3--4--5--6--7--8--9--10

Creating our own aggregates

With SQLite, it is very easy to define custom aggregate functions. We only need to create a class that implements two methods, and then register the class with our peewee database instance.

Let's see how this is done by writing an aggregate function to calculate the mode for a list of values. If it's been a while since your 8th grade math class, the mode is the most common value in a list.

We'll begin by defining our class. To do the actual calculation we'll use a handy container object from the standard library, collections.Counter.

from collections import Counter

class Mode(object):
    def __init__(self):
        self.counter = Counter()

The aggregate interface specifies two methods, step() and finalize(). step is called once for each row in the result set, and we will use this method to update the counter with the incoming value. finalize is called once when there are no more rows and we are ready to yield a final result. This method will find the most common value in the counter and return it.

@db.aggregate()
class Mode(object):
    def __init__(self):
        self.counter = Counter()

    def step(self, *values):
        self.counter.update(values)

    def finalize(self):
        # If there are no items in the counter, we'll return None.
        if self.counter:
            return self.counter.most_common(1)[0][0]

We can now use the new aggregate function just like any other built-in aggregate. The following code example shows how to calculate the mode for a table of values. We will add some new rows to the IntData table we created in the previous example:

values = [1, 1, 2, 3, 3, 3, 4, 5, 5]
(IntData
 .insert_many({'value': value} for value in values)
 .execute())

# Call the mode function.
print IntData.select(fn.Mode(IntData.value)).scalar()
# OUTPUT: 3

If you prefer not to use a decorator, you can also call db.register_aggregate(), passing in the class.

Ideas for useful aggregates

I thought I'd share some ideas for other custom aggregations you might add:

Variance and StdDev, which exist in PostgreSQL but are not included with SQLite.
MD5Sum or SHA1Sum for calculating a checksum.
First and Last.
Bitwise and logical operators that AND or OR together multiple values.

Experimenting with Virtual Tables

By far the most interesting hook provided by SQLite is the virtual table mechanism. SQLite virtual tables allow you to expose a SQL query interface over literally any tabular data source. For instance, you could write a virtual table to expose the filesystem as a database table, query a 3rd party RESTful API, or expose your Redis instance as a SQLite table. In my post last week, the transitive closure extension provides a virtual table for querying hierarchical data, but the actual data is stored behind-the-scenes in an AVL tree.

Unfortunately, the standard-library SQLite driver does not come with support for creating virtual tables. Luckily, there is a more powerful SQLite library APSW that makes this functionality available to Python programmers. Peewee comes with an APSW database driver, so you can use APSW to create virtual tables in Python, then use Peewee models to work with the virtual tables.

Exposing Redis as a Virtual Table

Virtual tables, because they provide so much functionality, are rather complex beasts to implement. For that reason, rather than providing the entire code to the redis example, I'll try to provide an overview of the important methods with an explanation of what's going on.

When implementing a virtual table, we need to write three classes which will answer the following questions:

What does our data look like? If our data were an actual SQL table, how would we declare it?
How will we map data into rows and columns?
How will we iterate through the data?
How will we apply filters from the WHERE clause, particularly if this would lead to implementation-specific optimizations?
How do we update, insert or delete rows?

Through the Module, Table and Cursor classes, we will provide the answers to these questions.

The Module Class

At the highest-level, we have the module class, which is responsible for describing the table structure. The module class contains a Create method, which provides a table definition and returns an instance of the user-defined Table object. Here is the Create method for the redis module:

def Create(self, db, modulename, dbname, tablename, *args):
    schema = 'CREATE TABLE %s (rowid, key, value, type, parent);'
    return schema % tablename, RedisTable(tablename, self.db)

The module is the visible portion of your virtual table API, and is registered with the database connection when one is opened.

The Table Class

The next class describes the virtual table (RedisTable), which is instantiated by the module's Create method. The table class receives queries from SQLite and determines which indexes (if any) to use to efficiently generate results. If your data-source supports writes, the table also handles inserts, updates and deletes.

For the RedisTable, we are primarily interested in whether we are searching the entire list of keys, or whether we are searching for items located at a key, such as the key/value pairs of a hash, or the keys and scores of a zset. The table's BestIndex method can be simple or complex, depending on your needs. Here is the BestIndex method for the RedisTable class:

def BestIndex(self, constraints, orderbys):
    """
    Example query:

    SELECT * FROM redis_tbl
    WHERE parent = 'my-hash' AND type = 'hash';

    Since parent is column 4 and type is colum 3, the constraints will be:

    (4, apsw.SQLITE_INDEX_CONSTRAINT_EQ),
    (3, apsw.SQLITE_INDEX_CONSTRAINT_EQ)

    Ordering will be a list of 2-tuples consisting of the column index
    and boolean for descending.

    Return values are:

    * Constraints used, which for each constraint, must be either None,
      an integer (the argument number for the constraints passed into the
      Filter() method), or (int, bool) tuple.
    * Index number (default zero).
    * Index string (default None).
    * Boolean whether output will be in same order as the ordering specified.
    * Estimated cost in disk operations.
    """
    constraints_used = []
    columns = []
    for i, (column_idx, comparison) in enumerate(constraints):
        constraints_used.append(i)
        columns.append(self._columns[column_idx])

    return [
        constraints_used,  # Indices of constraints we are interested in.
        0,  # The index number, not used by us.
        ','.join(columns),  # The index name, a list of filter columns.
        False,  # Whether the results are ordered.
        1000 if 'parent' in columns else 10000]

If you like, you can also leave off the BestIndex method and let SQLite perform the filtering manually (using a table scan). For large amounts of data, or for data which may change depending on the filters (as is the case with our Redis virtual table), the BestIndex method provides a way to control what data is passed to SQLite.

The Cursor Class

The final class is the cursor, which we will use to actually extract data from Redis and apply the appropriate filters. The cursor implements a Filter method which receives the data specified by the return value of BestIndex. For our Redis example, we will use the following logic:

If no parent key is specified, our base data-set will be all keys in the database.
If a parent key is specified, depending on the type of data stored in the key, generate appropriate tabular data.

The important thing to note is that we do not need to do all the processing in our virtual table -- SQLite will actually do most of the work for us, so long as we give it some tabular data. Here is the Filter method, and the get_data_for_key helper:

def Filter(self, indexnum, indexname, constraintargs):
    """
    This method is always called first to initialize an iteration to the
    first row of the table. The arguments come from the BestIndex() method
    in the table object with constraintargs being a tuple of the
    constraints you requested.
    """
    columns = indexname.split(',')
    column_to_value = dict(zip(columns, constraintargs))
    if 'parent' in column_to_value:
        initial_key = column_to_value['parent']
        data = self.get_data_for_key(initial_key)
    else:
        data = []
        for i, key in enumerate(self.db.keys()):
            key_type = self.db.type(key)
            if key_type == 'string':
                value = self.db.get(key)
            else:
                value = None
            data.append((i, key, value, None, key_type))

    self.data = data
    self.index = 0
    self.nrows = len(data)

def get_data_for_key(self, key):
    # 'rowid', 'key', 'value', 'type', 'parent'
    key_type = self.db.type(key)
    if key_type == 'list':
        return [
            (i, i, value, 'list', key)
            for i, value in enumerate(self.db.lrange(key, 0, -1))]
    elif key_type == 'set':
        return [
            (i, value, None, 'set', key)
            for i, value in enumerate(self.db.smembers(key))]
    elif key_type == 'zset':
        all_members = self.db.zrange(key, 0, -1, withscores=True)
        return [
            (i, value, score, 'zset', key)
            for i, (value, score) in enumerate(all_members)]
    elif key_type == 'hash':
        return [
            (i, k, v, 'hash', key)
            for i, (k, v) in enumerate(self.db.hgetall(key).iteritems())]
    elif key_type == 'none':
        return []
    else:
        return [(1, key, self.db.get(key), 'string', key)]

This is pretty much all of the code! The rest is just book-keeping and stubbed-out methods. Of course, the sky's the limit with virtual tables. I'm only just beginning to learn how to work with them, but they seem like a great way to create some truly ridiculous hacks.

For more information on virtual tables, here are some links you might find useful:

Thanks for reading

Thanks for taking the time to read this post, I hope you found it interesting. Feel free to leave a comment below if you have any comments or questions.

Comments (1)

yegle | dec 05 2014, at 01:30pm

I wonder if there's any other ORMs out there that support these SQLite advanced features?

Commenting has been closed.