Going Fast with SQLite and Python

Sqlite Logo

In this post I'd like to share with you some techniques for effectively working with SQLite using Python. SQLite is a capable library, providing an in-process relational database for efficient storage of small-to-medium-sized data sets. It supports most of the common features of SQL with few exceptions. Best of all, most Python users do not need to install anything to get started working with SQLite, as the standard library in most distributions ships with the sqlite3 module.

The documentation for the standard library SQLite module (pysqlite) is quite thorough, so I will not try to rehash what is already covered in detail (and kept up-to-date, unlike this blog...). Instead I'll discuss some ways to achieve better performance from SQLite. I'll also share some tricks you may not have been aware of. I hope you find this information helpful.

Transactions, Concurrency, and Autocommit

pysqlite is notorious for implementing weird transactional semantics. By default, pysqlite will open a transaction when you issue your first write query (it does this by examining every query you execute). The transaction will be automatically committed when you either call Connection.commit() or execute any other query that is not a SELECT, INSERT, UPDATE, or DELETE (for example CREATE TABLE or PRAGMA).

This makes it very easy to issue a write, which acquires the global SQLite write lock, and then unnecessarily hold that lock while you issue SELECT queries, etc, which have no need of the write lock. Because SQLite only allows a single writer per database, it's typically in your best interests to keep write transactions as short as possible. The semantics of pysqlite can give a false impression that SQLite is completely unsuitable for any application that requires concurrent database access, despite the fact that writes typically occur very quickly.

There are a couple ways to address the problematic interaction of the global write lock and the bad behavior of pysqlite. The most general would be to use the write-ahead-logging (WAL) journal_mode option. WAL-mode allows multiple readers to co-exist with a single writer. Ordinarily, when one connection is holding the write lock, no other connection can write or read until the lock is released. WAL-mode relaxes these restrictions by allowing readers to operate while another connection writes to the database.

Using pysqlite in autocommit mode and explicitly managing transactional state in your application can also provide better performance. In this way you can ensure that you are not holding a write-lock any longer than absolutely necessary. Unless you explicitly issue a BEGIN statement, opening a transaction, all statements will be executed independently, in their own transactions. This means that the write lock is held for the minimum amount of time necessary. As I mentioned earlier, writes occur very quickly, so it is possible for many connections to write to the database in a performant manner even though the writes occur one-at-a-time.

Another, more drastic, approach that may work for some applications is to maintain a dedicated write-thread, and send all writes to a single connection (described in more detail here). As with the other suggestions, under high write pressure latency could become an issue. It is possible that the queue could become excessively large, leading to lags when an application thread needs to verify a write before continuing, so this approach only works in some scenarios.

Example code:

# Open database in autocommit mode by setting isolation_level to None.
conn = sqlite3.connect('app.db', isolation_level=None)

# Set journal mode to WAL.
conn.execute('pragma journal_mode=wal')

Here is an example illustrating the difference between the default journal mode, delete, and wal-mode:

### Default behavior.

import sqlite3

writer = sqlite3.connect('/tmp/scratch.db', isolation_level=None)
reader = sqlite3.connect('/tmp/scratch.db', isolation_level=None)

writer.execute('create table foo (data)')
reader.execute('select * from foo;')  # No problem.

writer.execute('begin exclusive;')
reader.execute('select * from foo;')  # OperationalError: database is locked

### WAL-mode.

writer = sqlite3.connect('/tmp/wal.db', isolation_level=None)
writer.execute('pragma journal_mode=wal;')

reader = sqlite3.connect('/tmp/wal.db', isolation_level=None)
reader.execute('pragma journal_mode=wal;')

writer.execute('create table foo (data)')
reader.execute('select * from foo')  # No problem.

writer.execute('begin exclusive')  # Acquire write lock.
reader.execute('select * from foo')  # Still no problem!

You can implement a simple context manager for handling transactions:

from contextlib import contextmanager

def transaction(conn):
    # We must issue a "BEGIN" explicitly when running in auto-commit mode.
        # Yield control back to the caller.
        conn.rollback()  # Roll back all changes if an exception occurs.

My experience has been that by ensuring that writes are committed quickly, SQLite can be a great fit for web-based and multi-threaded applications. The WAL journal-mode and autocommit-mode also help ensure my assumptions about locks and transactional state are correct.

User-defined functions

SQLite runs embedded in memory alongside your application, allowing you to easily extend SQLite with your own Python code. SQLite provides quite a few hooks, a reasonable subset of which are implemented by the standard library database driver. The library documentation is thorough and provides example code:

The most powerful hook of all is not available through the standard libary driver, however: the ability to programmatically define complete tables (which are then queried using ordinary SQL). Using the virtual table APIs, it is possible to create completely dynamic tables. Or, put another way, to create user-defined functions that return tabular data.

The apsw SQLite driver provides hooks for implementing virtual tables in Python, but the APIs are close correlates to their C equivalents and can be tricky to work with for simple use-cases. A simpler approach is available through the vtfunc library (and will be included in the 3.x release of peewee).

Let's look at a simple example using vtfunc, in which we'll create a user-defined function that returns tabular results (which can be joined, filtered, ordered, etc just like any normal query).

Generating series of integers

As a simple example, let's re-implement the SQLite generate_series extension, which behaves like Python's range builtin. We begin by defining a class that implements two methods (initialize and iterate) and defines two required attributes (params, columns):

import sqlite3
from vtfunc import TableFunction

class Series(TableFunction):
    params = ['start', 'stop', 'step']  # These are the arguments to our function.
    columns = ['output']  # These are the columns our output rows will have.
    name = 'series'  # Optional -- if unspecified, uses lower-case class name.

    def initialize(self, start=0, stop=None, step=1):

    def iterate(self, idx):

When our function is first called, the arguments specified by the caller are passed to the initialize function. Arguments may have default values. Then, when the database needs a row from our table (to populate a result set), the iterate() method will be called, returning a row of data or raising StopIteration if no more data is available:

class Series(TableFunction):
    params = ['start', 'stop', 'step']  # These are the arguments to our function.
    columns = ['idx', 'output']  # These are the columns our output rows will have.
    name = 'series'  # Optional -- if unspecified, uses lower-case class name.

    def initialize(self, start=0, stop=None, step=1):
        self.start = self.curr = start
        self.stop = stop if stop is not None else float('inf')
        self.step = step

    def iterate(self, idx):
        if (self.step > 0 and self.curr >= self.stop) or \
           (self.step < 0 and self.curr <= self.stop):
            raise StopIteration
        ret = self.curr
        self.curr += self.step
        return (idx, ret)

To start using our new series function, we'll register it with a connection and then query it:

conn = sqlite3.connect(':memory:')

for row in conn.execute('SELECT * FROM series(0, 10, 2)'):
    print row

# (0, 0)
# (1, 2)
# (2, 4)
# (3, 6)
# (4, 8)

print conn.execute('SELECT * FROM series(0, NULL, 20) LIMIT 4').fetchall()
# [(0, 0), (1, 20), (2, 40), (3, 60)]

Using virtual tables, it's even possible to expose a SQL interface over a completely separate data source like Redis or CSV data.

More examples are available here:

Useful PRAGMAs

SQLite requires zero configuration to get up and running, but in order to go fast there are a few settings you may want to adjust. SQLite is configured by executing special queries of the form PRAGMA <setting> = <value>. Here are a few that I routinely find myself modifying:

Non-persistent PRAGMA queries should be executed whenever a new connection is opened. Of the above, only journal_mode will be persisted after the database is created (and then, only if you are setting it to WAL-mode).

Compilation Flags

Many distributions ship with an old-ish version of SQLite that does not include some of the cool extension modules. Here are flags I typically use to compile SQLite:

For debugging / profiling, the following additional flags may be useful:

To compile the latest version of SQLite, you can run something like the following:

fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
mkdir sqlite-src
cd sqlite-src/
fossil open ../sqlite.fossil

export CFLAGS="$CFLAGS -fPIC -O2"
export PREFIX="$(pwd)"
LIBS="-lm" ./configure --enable-static --enable-shared --prefix="$PREFIX"
make && make install

You can create self-contained pysqlite libraries by statically-linking against your custom build of SQLite:

git clone https://github.com/ghaering/pysqlite
cd pysqlite/
cp $PREFIX/sqlite3.c .  # Copy sqlite3.c into checkout.
echo -e "library_dirs=$PREFIX/lib" >> setup.cfg
echo -e "include_dirs=$PREFIX/include" >> setup.cfg
LIBS="-lm" python setup.py build_static

You can use the statically-linked pysqlite just as you would normally use the standard library's sqlite3 module:

from pysqlite2 import dbapi2 as sqlite3

conn = sqlite3.connect(':memory:')
print conn.execute('PRAGMA compile_options').fetchall()
#  (u'DEFAULT_CACHE_SIZE=-8000',),
#  (u'ENABLE_FTS3',),
#  etc...

More links

For more information along these lines, here are some links you may find useful:

Comments (0)

Commenting has been closed, but please feel free to contact me