Using SQLite Full-Text Search with Python
In this post I will show how to use SQLite full-text search with Python (and a lot of help from peewee ORM). We will see how to index content for searching, how to perform searches, and how to order search results using two ranking algorithms.
Last week I migrated my site from Postgresql to SQLite. I had been using Redis to power my site's search, but since SQLite has an awesome full-text search extension, I decided to give it a try. I am really pleased with the results, and being able to specify boolean search queries is an added plus. Here is a brief overview of the types of search queries SQLite supports:
- Simple phrase: peewee would return all docs containing the word peewee.
- Prefix queries: py* would return docs containing Python, pypi, etc.
- Quoted phrases: "sqlite extension"
NEAR: peewee NEAR sqlite would return docs containing the words peewee and sqlite with no more than
10intervening words. You can also specify the max number of intervening words, e.g. peewee NEAR/3 sqlite. With SQLite FTS5 the syntax is slightly different: NEAR(peewee sqlite, 3).
NOT: sqlite OR postgresql AND NOT mysql would return docs about high-quality databases (just trollin).
If you'd like to give it a try, here are a couple example searches on my blog:
In order to use this cool SQLite feature, we need to use the SqliteExtDatabase class for our application's database:
from playhouse.sqlite_ext import SqliteExtDatabase db = SqliteExtDatabase('blog.db')
Since we plan to use this database for a multi-user application, let's also specify a couple SQLite
PRAGMA options that will give our application better performance:
pragmas = [ ('journal_mode', 'wal'), ('cache_size', -1000 * 32)] db = SqliteExtDatabase('blog.db', pragmas=pragmas)
This database subclass comes with special support for SQLite extensions and virtual tables, which we will need to index content for searching. If you're curious, the pragmas we set instruct SQLite to use the write-ahead logging journal mode, which enables multiple readers to co-exist with a single writer. The cache_size pragma tells SQLite how much memory to use for the page-cache (negative values are in kibibytes, positive values are in pages).
Note for Windows users: it's come to my attention that some Python builds for Windows include a SQLite that does not have the full-text search extensions enabled. To compile a SQLite DLL, check out the SQLite downloads page and compile with the
A note on virtual tables
In order to use full-text searches with SQLite, you need to create a special virtual table in the database for the content you wish to index. Rows are created in this table just like any other, but behind-the-scenes the FTS extension will create some indexes that will enable fast searches. Since we'll be using peewee ORM, we need to create a Model to represent the search content. The peewee SQLite helpers will allow us to create the model with the appropriate table definition. To create a virtual table for the FTS extension, simply subclass
playhouse.sqlite_ext.FTSModel instead of the usual
Creating the Models
With our database configured, we'll create a really basic blog entry model that has a
title and a
content field. In reality you may have a number of other fields like
timestamp, etc, but since the other columns have no effect on the rest of the implementation, I will skip them. Here is the model definition:
from peewee import * from playhouse.sqlite_ext import * pragmas = [ ('journal_mode', 'wal'), ('cache_size', -1024 * 32)] db = SqliteExtDatabase('blog.db', pragmas=pragmas) class Entry(Model): title = TextField() content = TextField() # You may have any number of other fields, such as status, # timestamp, etc. class Meta: database = db
We now need to create a special model for storing and indexing the entry content. Since each row in the index table will correspond exactly to one blog entry, we can specify the
entry as the primary key. If you have multiple columns you wish to search, such as a
title and a
content field, you could add them to the
FTSEntry model or just concatenate the data together and dump it into a single text field. For simplicity we'll do the latter and store everything in a single
class FTSEntry(FTSModel): content = TextField() class Meta: database = db
You may wonder why not just have all your searchable models be subclasses of
FTSModel. I would advise against doing this because of some limitations to SQLite's virtual tables. Virtual tables do not support indexes, which can really hurt performance. Additionally the full-text virtual tables treat all columns as having the
TEXT data-type, regardless of the actual column definition. So instead of using
FTSModel for everything, typically you'll keep your data in a normal table, and then store the search content in a related virtual table.
Open up a python shell and create the tables:
# Create the tables. >>> from search_example import Entry, FTSEntry >>> Entry.create_table() >>> FTSEntry.create_table()
After creating the tables from the python shell, we can introspect the schema and see that SQLite has created the
FTSEntry table as well as several other tables to store search metadata:
sqlite> .schema CREATE TABLE "entry" ( "id" INTEGER NOT NULL PRIMARY KEY, "title" TEXT NOT NULL, "content" TEXT NOT NULL); CREATE VIRTUAL TABLE "ftsentry" USING FTS4 ( "content" TEXT NOT NULL); CREATE TABLE 'ftsentry_content'(...) CREATE TABLE 'ftsentry_segments'(...) CREATE TABLE 'ftsentry_segdir'(...) CREATE TABLE 'ftsentry_docsize'(...) CREATE TABLE 'ftsentry_stat'(...)
For this example I'm going to use the actual blog entries from site, but if you're following along and want to create some entries, you might write some code like this:
entry = Entry.create( title='How I rewrote everything with golang', content='Blah blah blah, type system, channels, blurgh') FTSEntry.create( docid=entry.id, content='\n'.join((entry.title, entry.content))) entry = Entry.create( title='Why ORMs are a terrible idea', content='Blah blah blah, leaky abstraction, impedance mismatch') FTSEntry.create( docid=entry.id, # Manually set the primary key ("docid") to entry's id. content='\n'.join((entry.title, entry.content)))
As you can see, indexing content is as simple as creating a new row in the
FTSEntry table. Coming up with good content is another story...
SQLite uses the
MATCH operator to express a full-text search query, e.g.
sqlite> SELECT entry.title ...> FROM entry ...> JOIN ftsentry ON entry.id = ftsentry.docid ...> WHERE ftsentry MATCH 'django AND peewee AND flask'; Structuring flask apps, a how-to for those coming from Django Peewee was baroque, so I rewrote it The missing library: ad-hoc queries for your models Using Redis Pub/Sub and IRC for Error Logging with Python Don't sweat the small stuff - use flask blueprints Integrating the flask microframework with the peewee ORM Peewee, a lightweight Python ORM - Original Post
You can still use equality or
LIKE comparisons with the
FTSEntry table, but these will be slow because they involve a full table scan. The
MATCH operator, on the other hand, makes use of the special full-text search indexes and is very performant. From peewee, we can perform simple search queries by using the
Suppose that we wanted to perform a search and retrieve the populated
Sorting by best match
By default, the results returned from a
MATCH query are in an unspecified order. For full-text search to be useful, though, the results should be ordered by relevance. SQLite's FTS extension does not come with a relevance function per-se, but it is possible to extract metadata from the full-text index and get pretty good results. The SQLite documentation on FTS contains a brief appendix which describes one method I will demonstrate below. The idea is that we will use the SQLite matchinfo function, which returns information about term counts, to rank the result rows.
blob returned by
matchinfo() contains several fields:
p: the number of matchable phrases in the query.
c: the number of user-defined columns on the FTS table, excluding
docidor any hidden meta-columns.
x: for each combination of column and phrase (
c * p), return three values:
- The number of times the phrase appears in the column for the current row.
- The total number of times the phrase appears in the column across all rows.
- The total number of rows in the FTS table which contain the phrase in the column.
If you are using FTS4, then there are several additional fields available:
n: the number of rows in the table
a: for each column, the average number of tokens that are stored.
l: for each column, the length of the value stored in the current row (in tokens).
s: for each column, the length of the longest subsequence of phrase matches that the column value has in common with the query text.
Our first ranking function will calculate ∑(x1 / x2) for each phrase/column pair in a given row. We will implement this function in Python, then register it with the SQLite connection as a user-defined function.
sqlite_ext module defines and automatically registers this function for you. This code is presented by way of explanation in case you're curious how things work.
matchinfo() function returns a blob of 32-bit unsigned ints for each value, so we will need to convert this data to a Python
def _parse_match_info(buf): bufsize = len(buf) # Length in bytes. return [struct.unpack('@I', buf[i:i+4]) for i in range(0, bufsize, 4)]
The length of the
matchinfo data varies depending on the number of phrases in the search query, and the number of columns on the table being queried. Our
rank function will determine the number of phrases and columns, then calculate a sum of counts, which we will treat as the rank of the document:
def rank(raw_match_info): # handle match_info called w/default args 'pcx' - based on the example rank # function http://sqlite.org/fts3.html#appendix_a match_info = _parse_match_info(raw_match_info) score = 0.0 p, c = match_info[:2] for phrase_num in range(p): phrase_info_idx = 2 + (phrase_num * c * 3) for col_num in range(c): col_idx = phrase_info_idx + (col_num * 3) x1, x2 = match_info[col_idx:col_idx + 2] if x1 > 0: score += float(x1) / x2 return -score
Now that we have our ranking function ready, we can start to use it in our SQLite queries. The peewee SQLite helper module automatically handles registering this function for you, so it will just work.
Here is an example of the SQL we will now generate to order search results by relevance:
SELECT entry.title, rank(matchinfo(ftsentry)) AS score FROM entry JOIN ftsentry ON entry.id = ftsentry.docid WHERE ftsentry MATCH 'leet desktop' ORDER BY score; -- returns -- Ricing the Desktop: "Brown rice", -0.273 In Honor of Spring..., -0.152 leet, -0.148 Using SQLite Full-Text Search with Python, -0.090
The first result returned contains the best match, and has two occurrences of the term
leet, and four for
desktop. The second result contains the term
desktop three times. You can see that the ranking algorithm gave each row a score corresponding to the number of times the search term occurs.
Peewee provides convenience methods to handle this common use-case, so we can simply write:
query = (Entry .select(Entry, FTSEntry.rank().alias('score')) .join(FTSEntry, on=(Entry.id == FTSEntry.docid)) .where(FTSEntry.match('leet desktop')) .order_by(FTSEntry.rank()))
If all of the information you need is contained in the FTS table, then you can use the
FTSModel.search() helper. In the following snippet we'll search the
FTSEntry table and return the first line of the indexed search content:
>>> for result in FTSEntry.search('sqlite AND berkeleydb'): ... print(result.content.splitlines(), round(result.score, 2)) Building the python SQLite driver for use with BerkeleyDB -0.68 Migrating to SQLite -0.56 Saturday morning hack: personalized news digest with boolean query parser -0.14
And that is all there is to it! As I mentioned earlier, the
playhouse.sqlite_ext module contains an implementation of the
rank() function and will automatically register it on the SQLite connection. The
sqlite_ext module also contains the
FTSModel class which can be used for full-text search virtual tables.
Ranking results using the Okapi BM25 algorithm
If you're looking for a more sophisticated ranking algorithm,
playhouse.sqlite_ext also contains an implementation of the Okapi BM25 algorithm, which is derived from this C code. This ranking method requires the
l fields of the
matchinfo structure, so this only works with FTS4 tables.
To use the BM25 algorithm, you can simply use the
If you are performing a JOIN or otherwise would like finer-grained control, you can write something like this instead:
The BM25 method supports several parameters for tuning, so if you're interested in more details, check out the documentation. For more control and direct access to the BM25 function, check out the BM25 helper docs.
Thanks for reading
- SQLite's full-text search extension
- Peewee full-text search documentation
- Implementing a rank() function
- Okapi BM25
Here are some blog posts on similar topics which you may find interesting:
- Scout, a search server powered by Python and SQLite
- Migrating to SQLite
- SQLite: small, fast, reliable, choose any three
- Building a little note-taking app with Flask and Peewee
Commenting has been closed.