Going Fast with SQLite and Python

November 01, 2017 15:30 / python sqlite / 0 comments

In this post I'd like to share with you some techniques for effectively working with SQLite using Python. SQLite is a capable library, providing an in-process relational database for efficient storage of small-to-medium-sized data sets. It supports most of the common features of SQL with few exceptions. Best of all, most Python users do not need to install anything to get started working with SQLite, as the standard library in most distributions ships with the sqlite3 module.

The documentation for the standard library SQLite module (pysqlite) is quite thorough, so I will not try to rehash what is already covered in detail (and kept up-to-date, unlike this blog...). Instead I'll discuss some ways to achieve better performance from SQLite. I'll also share some tricks you may not have been aware of. I hope you find this information helpful.

Transactions, Concurrency, and Autocommit

pysqlite is notorious for implementing weird transactional semantics. By default, pysqlite will open a transaction when you issue your first write query (it does this by examining every query you execute). The transaction will be automatically committed when you either call Connection.commit() or execute any other query that is not a SELECT, INSERT, UPDATE, or DELETE (for example CREATE TABLE or PRAGMA).

This makes it very easy to issue a write, which acquires the global SQLite write lock, and then unnecessarily hold that lock while you issue SELECT queries, etc, which have no need of the write lock. Because SQLite only allows a single writer per database, it's typically in your best interests to keep write transactions as short as possible. The semantics of pysqlite can give a false impression that SQLite is completely unsuitable for any application that requires concurrent database access, despite the fact that writes typically occur very quickly.

There are a couple ways to address the problematic interaction of the global write lock and the bad behavior of pysqlite. The most general would be to use the write-ahead-logging (WAL) journal_mode option. WAL-mode allows multiple readers to co-exist with a single writer. Ordinarily, when one connection is holding the write lock, no other connection can write or read until the lock is released. WAL-mode relaxes these restrictions by allowing readers to operate while another connection writes to the database.

Using pysqlite in autocommit mode and explicitly managing transactional state in your application can also provide better performance. In this way you can ensure that you are not holding a write-lock any longer than absolutely necessary. Unless you explicitly issue a BEGIN statement, opening a transaction, all statements will be executed independently, in their own transactions. This means that the write lock is held for the minimum amount of time necessary. As I mentioned earlier, writes occur very quickly, so it is possible for many connections to write to the database in a performant manner even though the writes occur one-at-a-time.

Another, more drastic, approach that may work for some applications is to maintain a dedicated write-thread, and send all writes to a single connection (described in more detail here). As with the other suggestions, under high write pressure latency could become an issue. It is possible that the queue could become excessively large, leading to lags when an application thread needs to verify a write before continuing, so this approach only works in some scenarios.

Example code:

# Open database in autocommit mode by setting isolation_level to None.
conn = sqlite3.connect('app.db', isolation_level=None)

# Set journal mode to WAL.
conn.execute('pragma journal_mode=wal')

Here is an example illustrating the difference between the default journal mode, delete, and wal-mode:

### Default behavior.

import sqlite3

writer = sqlite3.connect('/tmp/scratch.db', isolation_level=None)
reader = sqlite3.connect('/tmp/scratch.db', isolation_level=None)

writer.execute('create table foo (data)')
reader.execute('select * from foo;')  # No problem.

writer.execute('begin exclusive;')
reader.execute('select * from foo;')  # OperationalError: database is locked

### WAL-mode.

writer = sqlite3.connect('/tmp/wal.db', isolation_level=None)
writer.execute('pragma journal_mode=wal;')

reader = sqlite3.connect('/tmp/wal.db', isolation_level=None)
reader.execute('pragma journal_mode=wal;')

writer.execute('create table foo (data)')
reader.execute('select * from foo')  # No problem.

writer.execute('begin exclusive')  # Acquire write lock.
reader.execute('select * from foo')  # Still no problem!

You can implement a simple context manager for handling transactions:

from contextlib import contextmanager

@contextmanager
def transaction(conn):
    # We must issue a "BEGIN" explicitly when running in auto-commit mode.
    conn.execute('BEGIN')
    try:
        # Yield control back to the caller.
        yield
    except:
        conn.rollback()  # Roll back all changes if an exception occurs.
        raise
    else:
        conn.commit()

My experience has been that by ensuring that writes are committed quickly, SQLite can be a great fit for web-based and multi-threaded applications. The WAL journal-mode and autocommit-mode also help ensure my assumptions about locks and transactional state are correct.

User-defined functions

SQLite runs embedded in memory alongside your application, allowing you to easily extend SQLite with your own Python code. SQLite provides quite a few hooks, a reasonable subset of which are implemented by the standard library database driver. The library documentation is thorough and provides example code:

Connection.create_function()
Connection.create_aggregate()
Connection.create_collation()
Access control: Connection.set_authorizer()
Serialize and deserialize native Python data-types: register_converter() and register_adapter()

The most powerful hook of all is not available through the standard libary driver, however: the ability to programmatically define complete tables (which are then queried using ordinary SQL). Using the virtual table APIs, it is possible to create completely dynamic tables. Or, put another way, to create user-defined functions that return tabular data.

The apsw SQLite driver provides hooks for implementing virtual tables in Python, but the APIs are close correlates to their C equivalents and can be tricky to work with for simple use-cases. A simpler approach is available through the SQLite extension library included in the 3.x release of peewee.

Let's look at a simple example using vtfunc, a standalone implementation of the virtual-table helper included with Peewee. In this example we'll create a user-defined function that returns tabular results (which can be joined, filtered, ordered, etc just like any normal query).

Generating series of integers

As a simple example, let's re-implement the SQLite generate_series extension, which behaves like Python's range builtin. We begin by defining a class that implements two methods (initialize and iterate) and defines two required attributes (params, columns):

import sqlite3
from vtfunc import TableFunction


class Series(TableFunction):
    params = ['start', 'stop', 'step']  # These are the arguments to our function.
    columns = ['output']  # These are the columns our output rows will have.
    name = 'series'  # Optional -- if unspecified, uses lower-case class name.

    def initialize(self, start=0, stop=None, step=1):
        pass

    def iterate(self, idx):
        pass

When our function is first called, the arguments specified by the caller are passed to the initialize function. Arguments may have default values. Then, when the database needs a row from our table (to populate a result set), the iterate() method will be called, returning a row of data or raising StopIteration if no more data is available:

class Series(TableFunction):
    params = ['start', 'stop', 'step']  # These are the arguments to our function.
    columns = ['idx', 'output']  # These are the columns our output rows will have.
    name = 'series'  # Optional -- if unspecified, uses lower-case class name.

    def initialize(self, start=0, stop=None, step=1):
        self.start = self.curr = start
        self.stop = stop if stop is not None else float('inf')
        self.step = step

    def iterate(self, idx):
        if (self.step > 0 and self.curr >= self.stop) or \
           (self.step < 0 and self.curr <= self.stop):
            raise StopIteration
        ret = self.curr
        self.curr += self.step
        return (idx, ret)

To start using our new series function, we'll register it with a connection and then query it:

conn = sqlite3.connect(':memory:')
Series.register(conn)

for row in conn.execute('SELECT * FROM series(0, 10, 2)'):
    print row

# (0, 0)
# (1, 2)
# (2, 4)
# (3, 6)
# (4, 8)

print conn.execute('SELECT * FROM series(0, NULL, 20) LIMIT 4').fetchall()
# [(0, 0), (1, 20), (2, 40), (3, 60)]

Using virtual tables, it's even possible to expose a SQL interface over a completely separate data source like Redis or CSV data.

More examples and information is available here:

Useful PRAGMAs

SQLite requires zero configuration to get up and running, but in order to go fast there are a few settings you may want to adjust. SQLite is configured by executing special queries of the form PRAGMA <setting> = <value>. Here are a few that I routinely find myself modifying:

journal_mode = wal - enabling write-ahead-logging means that multiple readers can coexist with a single writer. Ordinarily, when a write lock is acquired, no other connection can write or read! So this can help a lot in read-heavy, multi-threaded applications. WAL-mode is also significantly faster in most scenarios.
cache_size = -size in KiB - the default cache size is ~2MB. Typically you will want your cache to be large enough to hold your working data in memory, so size up accordingly. Warning: positive values are treated as number of pages, negative values are treated as KiB.
mmap_size = size in bytes - may be more performant for I/O intensive applications, and may also use less RAM since pages can be shared with the OS cache. Check out the docs for more details.
synchronous = 0 - use with caution! Disabling syncs can cause data corruption in the event of operating system crash or sudden power loss. Application crashes will not lead to data loss in this mode, however.

Non-persistent PRAGMA queries should be executed whenever a new connection is opened. Of the above, only journal_mode will be persisted after the database is created (and then, only if you are setting it to WAL-mode).

Compilation Flags

Many distributions ship with an old-ish version of SQLite that does not include some of the cool extension modules. Here are flags I typically use to compile SQLite:

SQLITE_ALLOW_COVERING_INDEX_SCAN=1 -- enable cover index optimization
SQLITE_DEFAULT_CACHE_SIZE=-8000 -- more sane default
SQLITE_DEFAULT_SYNCHRONOUS=0 -- faster, corruption only possible due to power failure or os crash.
SQLITE_DEFAULT_WAL_SYNCHRONOUS=0
SQLITE_DISABLE_DIRSYNC -- small optimization to reduce syncs when files deleted
SQLITE_ENABLE_FTS3 -- enable all the full-text search extensions!
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1 -- enable native JSON support
SQLITE_ENABLE_STAT4 -- enable the statistics extension
SQLITE_ENABLE_UPDATE_DELETE_LIMIT -- allow LIMIT clause on UPDATE and DELETE queries.
SQLITE_STMTJRNL_SPILL=-1 -- do not spill the statement journal to disk
SQLITE_TEMP_STORE=3 -- never use disk for temporary storage
SQLITE_USE_URI -- allow URI connection strings

For debugging / profiling, the following additional flags may be useful:

SQLITE_ENABLE_COLUMN_METADATA -- make additional metadata available.
SQLITE_ENABLE_DBSTAT_VTAB -- more statistics! Check out the docs.
SQLITE_ENABLE_EXPLAIN_COMMENTS -- adds additional info to EXPLAIN output.
SQLITE_ENABLE_IOTRACE -- adds .iotrace command to shell for low-level I/O logging.
SQLITE_ENABLE_STMT_SCANSTATUS -- see docs.

To compile the latest version of SQLite, you can run something like the following:

fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
mkdir sqlite-src
cd sqlite-src/
fossil open ../sqlite.fossil

export CFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_JSON1..."  # etc...
export CFLAGS="$CFLAGS -fPIC -O2"
export PREFIX="$(pwd)"
LIBS="-lm" ./configure --enable-static --enable-shared --prefix="$PREFIX"
make && make install

You can create self-contained pysqlite3 libraries by statically-linking against your custom build of SQLite:

git clone https://github.com/coleifer/pysqlite3
cd pysqlite3/
cp $PREFIX/sqlite3.[ch] .  # Copy sqlite3.c and sqlite3.h into checkout.
python setup.py build_static build

You can use the statically-linked pysqlite3 just as you would normally use the standard library's sqlite3 module:

from pysqlite3 import dbapi2 as sqlite3

conn = sqlite3.connect(':memory:')
print conn.execute('PRAGMA compile_options').fetchall()
# [(u'ALLOW_COVERING_INDEX_SCAN',),
#  (u'DEFAULT_CACHE_SIZE=-8000',),
#  (u'ENABLE_FTS3',),
#  (u'ENABLE_FTS3_PARENTHESIS',),
#  etc...

You can find a bit more information on the README for pysqlite3.

Comments (0)

Commenting has been closed.