SQLite Table-Valued Functions with Python

One of the benefits of running an embedded database like SQLite is that you can configure SQLite to call into your application's code. SQLite provides APIs that allow you to create your own scalar functions, aggregate functions, collations, and even your own virtual tables. In this post I'll describe how I used the virtual table APIs to expose a nice API for creating table-valued (or, multi-value) functions in Python. The project is called sqlite-vtfunc and is hosted on GitHub. If you use Peewee, an equivalent implementation is included in the Peewee SQLite extensions.

First, a little terminology. Most people who have delved a bit into SQLite are acquainted with the fact that you can create scalar functions and then call them from your SQL queries. For example, you might implement an md5() function in your application and register it with SQLite, so that you can call md5() in your SQL queries. The only limitation is that user-defined functions can only return a single value.

SQLite's table-valued functions are functions that can return multiple rows of tabular data. Behind-the-scenes, these function calls are actually just queries on a virtual table. The SQLite generate_series extension is a good example:

sqlite> SELECT * FROM generate_series(1, 5);
1
2
3
4
5

If you're curious about the implementation of the generate_series virtual table, you can check out the C source. It's quite long, with the real meat of the implementation in just two functions, seriesFilter and seriesBestIndex.

My goal was to provide a wrapper around the virtual table API so that I could skip all the boilerplate and implement table-valued functions in Python.

Here is a generate_series implementation I created using the Python sqlite-vtfunc library:

from vtfunc import TableFunction

class GenerateSeries(TableFunction):
    params = ['start', 'stop', 'step']
    columns = ['output']
    name = 'series'

    def initialize(self, start=0, stop=None, step=1):
        self.start = start
        self.stop = stop or float('inf')
        self.step = step
        self.curr = self.start

    def iterate(self, idx):
        if self.curr > self.stop:
            raise StopIteration

        ret = self.curr
        self.curr += self.step
        return (ret,)

To use the Python version, you only need to register the module with your connection (the equivalent of sqlite3_create_module) and then query it:

import sqlite3

conn = sqlite3.connect(':memory:')

series = GenerateSeries()
series.register(conn)

cursor = conn.execute('SELECT * FROM series(0, 6, 2)')
print cursor.fetchall()  # Prints [(0,), (2,), (4,), (6,)]

Implementation

sqlite-vtfunc is implemented in Cython since pysqlite does not provide any hooks into the virtual table APIs. The virtual table mechanism of SQLite is an incredibly powerful tool, and is the basis for SQLite's full-text search engine, among other things. To support this level of complexity, the API is suitably dense.

Table-valued functions, on the other hand, really only need to know two things:

For that reason, when defining a table-valued function with sqlite-vtfunc, you only need to define an initialize() and an iterate() method.

Creating a new table-valued function

In this section I'll walk through the creation of a table-valued function that returns all matching subgroups from a search via regular-expression. For example, suppose we want to filter out all the email addresses from a block of text.

To get started, we need to define our function's parameters and it's result columns. For the regex search, we'll have two parameters: regex and search_string. The output will be a single column containing each matching sub-group. So we define the shell of our function like so:

import re  # Python regex module, useful later on.
from vtfunc import TableFunction

class RegexSearch(TableFunction):
    params = ['regex', 'search_string']
    columns = ['match']
    name = 'regex_search'

Now we just need to fill in the two methods, initialize() and iterate(). The initialize() method is called once, when our function is invoked, and it receives the parameters passed in by the user. We'll use the initialize() method to set up an iterator that will successively yield matching subgroups from the search string:

class RegexSearch(TableFunction):
    params = ['regex', 'search_string']
    columns = ['match']
    name = 'regex_search'

    def initialize(self, regex=None, search_string=None):
        self._iter = re.finditer(regex, search_string)

The iterate() method accepts a single parameter indicating what row is being asked for. This parameter does not matter for our particular implementation so we'll ignore it. The iterate() function will just advance the re.finditer by one and return the match as a tuple. When there are no more matches, a StopIteration() exception will be raised, which is the signal to SQLite that there are no more rows:

class RegexSearch(TableFunction):
    params = ['regex', 'search_string']
    columns = ['match']
    name = 'regex_search'

    def initialize(self, regex=None, search_string=None):
        self._iter = re.finditer(regex, search_string)

    def iterate(self, idx):
        return (next(self._iter).group(0),)

And that's all there is to our module! Let's use it to try and extract some email addresses from a chunk of text:

>>> import sqlite3
>>> conn = sqlite3.connect(':memory:')
>>> regex_search = RegexSearch()
>>> regex_search.register(conn)
>>> regex = r'[\w]+@[\w]+\.[\w]{2,3}'  # Hacky email regex.
>>> text = "Hello charlie@example.com, this is foo@baz.com. I'm writing to let you know that nugget@baze.com is displeased with your blog."
>>> curs = conn.execute('SELECT * FROM regex_search(?, ?)', (regex, text))
>>> curs.fetchall()
[(u'charlie@example.com',), (u'foo@baz.com',), (u'nugget@baze.com',)]

A suggestion for SQLite

It would be fantastic if SQLite provided an official C interface for creating minimal table-valued functions like these. I understand the reasons for not doing so, since there will always be folks like me who will write API wrappers, but having an official API would let library developers like @rogerbinns and @ghaering create wrappers as part of the standard SQLite tooling.

That's it and that's all

Thanks for taking the time to read this post, I hope you found it interesting. The sqlite-vtfunc module is brand new and probably has bugs, and the APIs may change. It was a fun challenge to work on and I'm excited about the possibilities this opens up for me as a Python developer who likes using SQLite.

One possibility that's enticing would be to implement a json_path table-valued function that works with SQLite's new json1 extension.

Any feedback or suggestions for improvement are appreciated. Thanks again for reading!

Comments (0)


Commenting has been closed.