Going Fast with SQLite and Python
In this post I'd like to share with you some techniques for effectively working with SQLite using Python. SQLite is a capable library, providing an in-process relational database for efficient storage of small-to-medium-sized data sets. It supports most of the common features of SQL with few exceptions. Best of all, most Python users do not need to install anything to get started working with SQLite, as the standard library in most distributions ships with the sqlite3 module.
The documentation for the standard library SQLite module (
pysqlite) is quite
thorough, so I will not try to rehash what is already covered in detail (and
kept up-to-date, unlike this blog...). Instead I'll discuss some ways to
achieve better performance from SQLite. I'll also share some tricks you may not
have been aware of. I hope you find this information helpful.
Transactions, Concurrency, and Autocommit
pysqlite is notorious for implementing weird transactional semantics. By
pysqlite will open a transaction when you issue your first write
query (it does this by examining every query you execute). The transaction will
be automatically committed when you either call
execute any other query that is not a SELECT, INSERT, UPDATE, or DELETE
(for example CREATE TABLE or PRAGMA).
This makes it very easy to issue a write, which acquires the global SQLite
write lock, and then unnecessarily hold that lock while you issue SELECT
queries, etc, which have no need of the write lock. Because SQLite only allows
a single writer per database, it's typically in your best interests to keep
write transactions as short as possible. The semantics of
pysqlite can give a
false impression that SQLite is completely unsuitable for any application that
requires concurrent database access, despite the fact that writes typically
occur very quickly.
There are a couple ways to address the problematic interaction of the global
write lock and the bad behavior of
pysqlite. The most general would be to use
the write-ahead-logging (WAL)
journal_mode option. WAL-mode allows multiple
readers to co-exist with a single writer. Ordinarily, when one connection is
holding the write lock, no other connection can write or read until the lock
is released. WAL-mode relaxes these restrictions by allowing readers to
operate while another connection writes to the database.
pysqlite in autocommit mode and explicitly managing transactional
state in your application can also provide better performance. In this way you
can ensure that you are not holding a write-lock any longer than absolutely
necessary. Unless you explicitly issue a BEGIN statement, opening a
transaction, all statements will be executed independently, in their own
transactions. This means that the write lock is held for the minimum amount of
time necessary. As I mentioned earlier, writes occur very quickly, so it is
possible for many connections to write to the database in a performant manner
even though the writes occur one-at-a-time.
Another, more drastic, approach that may work for some applications is to maintain a dedicated write-thread, and send all writes to a single connection (described in more detail here). As with the other suggestions, under high write pressure latency could become an issue. It is possible that the queue could become excessively large, leading to lags when an application thread needs to verify a write before continuing, so this approach only works in some scenarios.
# Open database in autocommit mode by setting isolation_level to None. conn = sqlite3.connect('app.db', isolation_level=None) # Set journal mode to WAL. conn.execute('pragma journal_mode=wal')
Here is an example illustrating the difference between the default journal mode, delete, and wal-mode:
### Default behavior. import sqlite3 writer = sqlite3.connect('/tmp/scratch.db', isolation_level=None) reader = sqlite3.connect('/tmp/scratch.db', isolation_level=None) writer.execute('create table foo (data)') reader.execute('select * from foo;') # No problem. writer.execute('begin exclusive;') reader.execute('select * from foo;') # OperationalError: database is locked ### WAL-mode. writer = sqlite3.connect('/tmp/wal.db', isolation_level=None) writer.execute('pragma journal_mode=wal;') reader = sqlite3.connect('/tmp/wal.db', isolation_level=None) reader.execute('pragma journal_mode=wal;') writer.execute('create table foo (data)') reader.execute('select * from foo') # No problem. writer.execute('begin exclusive') # Acquire write lock. reader.execute('select * from foo') # Still no problem!
You can implement a simple context manager for handling transactions:
from contextlib import contextmanager @contextmanager def transaction(conn): # We must issue a "BEGIN" explicitly when running in auto-commit mode. conn.execute('BEGIN') try: # Yield control back to the caller. yield except: conn.rollback() # Roll back all changes if an exception occurs. raise else: conn.commit()
My experience has been that by ensuring that writes are committed quickly, SQLite can be a great fit for web-based and multi-threaded applications. The WAL journal-mode and autocommit-mode also help ensure my assumptions about locks and transactional state are correct.
SQLite runs embedded in memory alongside your application, allowing you to easily extend SQLite with your own Python code. SQLite provides quite a few hooks, a reasonable subset of which are implemented by the standard library database driver. The library documentation is thorough and provides example code:
- Access control: Connection.set_authorizer()
- Serialize and deserialize native Python data-types: register_converter() and register_adapter()
The most powerful hook of all is not available through the standard libary driver, however: the ability to programmatically define complete tables (which are then queried using ordinary SQL). Using the virtual table APIs, it is possible to create completely dynamic tables. Or, put another way, to create user-defined functions that return tabular data.
The apsw SQLite driver provides hooks for implementing virtual tables in Python, but the APIs are close correlates to their C equivalents and can be tricky to work with for simple use-cases. A simpler approach is available through the vtfunc library (and will be included in the 3.x release of peewee).
Let's look at a simple example using vtfunc, in which we'll create a user-defined function that returns tabular results (which can be joined, filtered, ordered, etc just like any normal query).
Generating series of integers
As a simple example, let's re-implement the SQLite generate_series
extension, which behaves like Python's
range builtin. We begin by defining a
class that implements two methods (
iterate) and defines two
required attributes (
import sqlite3 from vtfunc import TableFunction class Series(TableFunction): params = ['start', 'stop', 'step'] # These are the arguments to our function. columns = ['output'] # These are the columns our output rows will have. name = 'series' # Optional -- if unspecified, uses lower-case class name. def initialize(self, start=0, stop=None, step=1): pass def iterate(self, idx): pass
When our function is first called, the arguments specified by the caller are
passed to the
initialize function. Arguments may have default values. Then,
when the database needs a row from our table (to populate a result set), the
iterate() method will be called, returning a row of data or raising
StopIteration if no more data is available:
class Series(TableFunction): params = ['start', 'stop', 'step'] # These are the arguments to our function. columns = ['idx', 'output'] # These are the columns our output rows will have. name = 'series' # Optional -- if unspecified, uses lower-case class name. def initialize(self, start=0, stop=None, step=1): self.start = self.curr = start self.stop = stop if stop is not None else float('inf') self.step = step def iterate(self, idx): if (self.step > 0 and self.curr >= self.stop) or \ (self.step < 0 and self.curr <= self.stop): raise StopIteration ret = self.curr self.curr += self.step return (idx, ret)
To start using our new
series function, we'll register it with a connection
and then query it:
conn = sqlite3.connect(':memory:') Series.register(conn) for row in conn.execute('SELECT * FROM series(0, 10, 2)'): print row # (0, 0) # (1, 2) # (2, 4) # (3, 6) # (4, 8) print conn.execute('SELECT * FROM series(0, NULL, 20) LIMIT 4').fetchall() # [(0, 0), (1, 20), (2, 40), (3, 60)]
Using virtual tables, it's even possible to expose a SQL interface over a completely separate data source like Redis or CSV data.
More examples are available here:
SQLite requires zero configuration to get up and running, but in order to go
fast there are a few settings you may want to adjust. SQLite is configured by
executing special queries of the form
PRAGMA <setting> = <value>. Here are a
few that I routinely find myself modifying:
- journal_mode = wal - enabling write-ahead-logging means that multiple readers can coexist with a single writer. Ordinarily, when a write lock is acquired, no other connection can write or read! So this can help a lot in read-heavy, multi-threaded applications. WAL-mode is also significantly faster in most scenarios.
- cache_size = -size in KiB - the default cache size is ~2MB. Typically you will want your cache to be large enough to hold your working data in memory, so size up accordingly. Warning: positive values are treated as number of pages, negative values are treated as KiB.
- mmap_size = size in bytes - may be more performant for I/O intensive applications, and may also use less RAM since pages can be shared with the OS cache. Check out the docs for more details.
- synchronous = 0 - use with caution! Disabling syncs can cause data corruption in the event of operating system crash or sudden power loss. Application crashes will not lead to data loss in this mode, however.
Non-persistent PRAGMA queries should be executed whenever a new connection is opened. Of the above, only journal_mode will be persisted after the database is created (and then, only if you are setting it to WAL-mode).
Many distributions ship with an old-ish version of SQLite that does not include some of the cool extension modules. Here are flags I typically use to compile SQLite:
SQLITE_ALLOW_COVERING_INDEX_SCAN=1-- enable cover index optimization
SQLITE_DEFAULT_CACHE_SIZE=-8000-- more sane default
SQLITE_DEFAULT_SYNCHRONOUS=0-- faster, corruption only possible due to power failure or os crash.
SQLITE_DISABLE_DIRSYNC-- small optimization to reduce syncs when files deleted
SQLITE_ENABLE_FTS3-- enable all the full-text search extensions!
SQLITE_ENABLE_JSON1-- enable native JSON support
SQLITE_ENABLE_STAT4-- enable the statistics extension
SQLITE_ENABLE_UPDATE_DELETE_LIMIT-- allow LIMIT clause on UPDATE and DELETE queries.
SQLITE_STMTJRNL_SPILL=-1-- do not spill the statement journal to disk
SQLITE_TEMP_STORE=3-- never use disk for temporary storage
SQLITE_USE_URI-- allow URI connection strings
For debugging / profiling, the following additional flags may be useful:
SQLITE_ENABLE_COLUMN_METADATA-- make additional metadata available.
SQLITE_ENABLE_DBSTAT_VTAB-- more statistics! Check out the docs.
SQLITE_ENABLE_EXPLAIN_COMMENTS-- adds additional info to
.iotracecommand to shell for low-level I/O logging.
SQLITE_ENABLE_STMT_SCANSTATUS-- see docs.
To compile the latest version of SQLite, you can run something like the following:
fossil clone http://www.sqlite.org/cgi/src sqlite.fossil mkdir sqlite-src cd sqlite-src/ fossil open ../sqlite.fossil export CFLAGS="-DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_JSON1..." # etc... export CFLAGS="$CFLAGS -fPIC -O2" export PREFIX="$(pwd)" LIBS="-lm" ./configure --enable-static --enable-shared --prefix="$PREFIX" make && make install
You can create self-contained
pysqlite libraries by statically-linking
against your custom build of SQLite:
git clone https://github.com/ghaering/pysqlite cd pysqlite/ cp $PREFIX/sqlite3.c . # Copy sqlite3.c into checkout. echo -e "library_dirs=$PREFIX/lib" >> setup.cfg echo -e "include_dirs=$PREFIX/include" >> setup.cfg LIBS="-lm" python setup.py build_static
You can use the statically-linked
pysqlite just as you would normally use the standard library's
from pysqlite2 import dbapi2 as sqlite3 conn = sqlite3.connect(':memory:') print conn.execute('PRAGMA compile_options').fetchall() # [(u'ALLOW_COVERING_INDEX_SCAN',), # (u'DEFAULT_CACHE_SIZE=-8000',), # (u'ENABLE_FTS3',), # (u'ENABLE_FTS3_PARENTHESIS',), # etc...
For more information along these lines, here are some links you may find useful:
- In-depth presentation on SQLite
- Excellent presentation with lots of performance tips
- SQLite gotchas and best practices (from mozilla's wiki)
- All the other posts I've written on SQLite
Commenting has been closed, but please feel free to contact me