Using SQLite Full-Text Search with Python

May 12, 2014 19:12 / peewee python search sqlite / 0 comments

In this post I will show how to use SQLite full-text search with Python (and a lot of help from peewee ORM). We will see how to index content for searching, how to perform searches, and how to order search results using two ranking algorithms.

Last week I migrated my site from Postgresql to SQLite. I had been using Redis to power my site's search, but since SQLite has an awesome full-text search extension, I decided to give it a try. I am really pleased with the results, and being able to specify boolean search queries is an added plus. Here is a brief overview of the types of search queries SQLite supports:

Simple phrase: peewee would return all docs containing the word peewee.
Prefix queries: py* would return docs containing Python, pypi, etc.
Quoted phrases: "sqlite extension"
NEAR: peewee NEAR sqlite would return docs containing the words peewee and sqlite with no more than 10 intervening words. You can also specify the max number of intervening words, e.g. peewee NEAR/3 sqlite. With SQLite FTS5 the syntax is slightly different: NEAR(peewee sqlite, 3).
AND, OR, NOT: sqlite OR postgresql AND NOT mysql would return docs about high-quality databases (just trollin).

If you'd like to give it a try, here are a couple example searches on my blog:

Getting started

In order to use this cool SQLite feature, we need to use the SqliteExtDatabase class for our application's database:

from playhouse.sqlite_ext import SqliteExtDatabase

db = SqliteExtDatabase('blog.db')

Since we plan to use this database for a multi-user application, let's also specify a couple SQLite PRAGMA options that will give our application better performance:

pragmas = [
    ('journal_mode', 'wal'),
    ('cache_size', -1000 * 32)]
db = SqliteExtDatabase('blog.db', pragmas=pragmas)

This database subclass comes with special support for SQLite extensions and virtual tables, which we will need to index content for searching. If you're curious, the pragmas we set instruct SQLite to use the write-ahead logging journal mode, which enables multiple readers to co-exist with a single writer. The cache_size pragma tells SQLite how much memory to use for the page-cache (negative values are in kibibytes, positive values are in pages).

Note for Windows users: it's come to my attention that some Python builds for Windows include a SQLite that does not have the full-text search extensions enabled. To compile a SQLite DLL, check out the SQLite downloads page and compile with the ENABLE_FTS3, ENABLE_FTS3_PARENTHESIS and ENABLE_FTS4 flags.

A note on virtual tables

In order to use full-text searches with SQLite, you need to create a special virtual table in the database for the content you wish to index. Rows are created in this table just like any other, but behind-the-scenes the FTS extension will create some indexes that will enable fast searches. Since we'll be using peewee ORM, we need to create a Model to represent the search content. The peewee SQLite helpers will allow us to create the model with the appropriate table definition. To create a virtual table for the FTS extension, simply subclass playhouse.sqlite_ext.FTSModel instead of the usual Model class.

Creating the Models

With our database configured, we'll create a really basic blog entry model that has a title and a content field. In reality you may have a number of other fields like status, timestamp, etc, but since the other columns have no effect on the rest of the implementation, I will skip them. Here is the model definition:

from peewee import *
from playhouse.sqlite_ext import *

pragmas = [
    ('journal_mode', 'wal'),
    ('cache_size', -1024 * 32)]

db = SqliteExtDatabase('blog.db', pragmas=pragmas)


class Entry(Model):
    title = TextField()
    content = TextField()
    # You may have any number of other fields, such as status,
    # timestamp, etc.

    class Meta:
        database = db

We now need to create a special model for storing and indexing the entry content. Since each row in the index table will correspond exactly to one blog entry, we can specify the entry as the primary key. If you have multiple columns you wish to search, such as a title and a content field, you could add them to the FTSEntry model or just concatenate the data together and dump it into a single text field. For simplicity we'll do the latter and store everything in a single content field:

class FTSEntry(FTSModel):
    content = TextField()

    class Meta:
        database = db

You may wonder why not just have all your searchable models be subclasses of FTSModel. I would advise against doing this because of some limitations to SQLite's virtual tables. Virtual tables do not support indexes, which can really hurt performance. Additionally the full-text virtual tables treat all columns as having the TEXT data-type, regardless of the actual column definition. So instead of using FTSModel for everything, typically you'll keep your data in a normal table, and then store the search content in a related virtual table.

Open up a python shell and create the tables:

# Create the tables.
>>> from search_example import Entry, FTSEntry
>>> Entry.create_table()
>>> FTSEntry.create_table()

After creating the tables from the python shell, we can introspect the schema and see that SQLite has created the FTSEntry table as well as several other tables to store search metadata:

sqlite> .schema
CREATE TABLE "entry" (
    "id" INTEGER NOT NULL PRIMARY KEY,
    "title" TEXT NOT NULL,
    "content" TEXT NOT NULL);
CREATE VIRTUAL TABLE "ftsentry" USING FTS4 (
    "content" TEXT NOT NULL);
CREATE TABLE 'ftsentry_content'(...)
CREATE TABLE 'ftsentry_segments'(...)
CREATE TABLE 'ftsentry_segdir'(...)
CREATE TABLE 'ftsentry_docsize'(...)
CREATE TABLE 'ftsentry_stat'(...)

For this example I'm going to use the actual blog entries from site, but if you're following along and want to create some entries, you might write some code like this:

  entry = Entry.create(
      title='How I rewrote everything with golang',
      content='Blah blah blah, type system, channels, blurgh')
  FTSEntry.create(
      docid=entry.id,
      content='\n'.join((entry.title, entry.content)))

  entry = Entry.create(
      title='Why ORMs are a terrible idea',
      content='Blah blah blah, leaky abstraction, impedance mismatch')

  FTSEntry.create(
      docid=entry.id,  # Manually set the primary key ("docid") to entry's id.
      content='\n'.join((entry.title, entry.content)))

As you can see, indexing content is as simple as creating a new row in the FTSEntry table. Coming up with good content is another story...

Performing searches

SQLite uses the MATCH operator to express a full-text search query, e.g.

sqlite> SELECT entry.title
   ...> FROM entry
   ...> JOIN ftsentry ON entry.id = ftsentry.docid
   ...> WHERE ftsentry MATCH 'django AND peewee AND flask';
Structuring flask apps, a how-to for those coming from Django
Peewee was baroque, so I rewrote it
The missing library: ad-hoc queries for your models
Using Redis Pub/Sub and IRC for Error Logging with Python
Don't sweat the small stuff - use flask blueprints
Integrating the flask microframework with the peewee ORM
Peewee, a lightweight Python ORM - Original Post

You can still use equality or LIKE comparisons with the FTSEntry table, but these will be slow because they involve a full table scan. The MATCH operator, on the other hand, makes use of the special full-text search indexes and is very performant. From peewee, we can perform simple search queries by using the FTSModel.match helper:

query = (Entry
         .select(Entry.title)
         .join(FTSEntry, on=(Entry.id == FTSEntry.docid))
         .where(FTSEntry.match('javascript AND canvas'))
         .dicts())
for row_dict in query:
    print(row_dict)

# Prints...
{'title': u'Using python and k-means to find the dominant colors in images'}
{'title': u'Even more Canvas fun - Tetris in JavaScript'}
{'title': u'More fun with Canvas - a JavaScript Starfield!'}
{'title': u'Nokia Snake with JavaScript + Canvas'}

Suppose that we wanted to perform a search and retrieve the populated Entry object:

query = (Entry
         .select()
         .join(FTSEntry, on=(Entry.id == FTSEntry.docid))
         .where(FTSEntry.match('javascript AND canvas')))
for entry in query:
    print(entry.title)

# Prints:
Using python and k-means to find the dominant colors in images
Even more Canvas fun - Tetris in JavaScript
More fun with Canvas - a JavaScript Starfield!
Nokia Snake with JavaScript + Canvas

Sorting by best match

By default, the results returned from a MATCH query are in an unspecified order. For full-text search to be useful, though, the results should be ordered by relevance. SQLite's FTS extension does not come with a relevance function per-se, but it is possible to extract metadata from the full-text index and get pretty good results. The SQLite documentation on FTS contains a brief appendix which describes one method I will demonstrate below. The idea is that we will use the SQLite matchinfo function, which returns information about term counts, to rank the result rows.

The blob returned by matchinfo() contains several fields:

p: the number of matchable phrases in the query.
c: the number of user-defined columns on the FTS table, excluding docid or any hidden meta-columns.
x: for each combination of column and phrase (c * p), return three values:
1. The number of times the phrase appears in the column for the current row.
2. The total number of times the phrase appears in the column across all rows.
3. The total number of rows in the FTS table which contain the phrase in the column.

If you are using FTS4, then there are several additional fields available:

n: the number of rows in the table
a: for each column, the average number of tokens that are stored.
l: for each column, the length of the value stored in the current row (in tokens).
s: for each column, the length of the longest subsequence of phrase matches that the column value has in common with the query text.

Our first ranking function will calculate ∑(x₁ / x₂) for each phrase/column pair in a given row. We will implement this function in Python, then register it with the SQLite connection as a user-defined function.

Note

The peewee sqlite_ext module defines and automatically registers this function for you. This code is presented by way of explanation in case you're curious how things work.

The matchinfo() function returns a blob of 32-bit unsigned ints for each value, so we will need to convert this data to a Python list of int:

def _parse_match_info(buf):
    bufsize = len(buf)  # Length in bytes.
    return [struct.unpack('@I', buf[i:i+4])[0] for i in range(0, bufsize, 4)]

The length of the matchinfo data varies depending on the number of phrases in the search query, and the number of columns on the table being queried. Our rank function will determine the number of phrases and columns, then calculate a sum of counts, which we will treat as the rank of the document:

def rank(raw_match_info):
    # handle match_info called w/default args 'pcx' - based on the example rank
    # function http://sqlite.org/fts3.html#appendix_a
    match_info = _parse_match_info(raw_match_info)
    score = 0.0
    p, c = match_info[:2]
    for phrase_num in range(p):
        phrase_info_idx = 2 + (phrase_num * c * 3)
        for col_num in range(c):
            col_idx = phrase_info_idx + (col_num * 3)
            x1, x2 = match_info[col_idx:col_idx + 2]
            if x1 > 0:
                score += float(x1) / x2
    return -score

Now that we have our ranking function ready, we can start to use it in our SQLite queries. The peewee SQLite helper module automatically handles registering this function for you, so it will just work.

Here is an example of the SQL we will now generate to order search results by relevance:

SELECT entry.title, rank(matchinfo(ftsentry)) AS score
FROM entry
JOIN ftsentry ON entry.id = ftsentry.docid
WHERE ftsentry MATCH 'leet desktop'
ORDER BY score;

-- returns --
Ricing the Desktop: "Brown rice", -0.273
In Honor of Spring..., -0.152
leet, -0.148
Using SQLite Full-Text Search with Python, -0.090

The first result returned contains the best match, and has two occurrences of the term leet, and four for desktop. The second result contains the term desktop three times. You can see that the ranking algorithm gave each row a score corresponding to the number of times the search term occurs.

Peewee provides convenience methods to handle this common use-case, so we can simply write:

query = (Entry
         .select(Entry, FTSEntry.rank().alias('score'))
         .join(FTSEntry, on=(Entry.id == FTSEntry.docid))
         .where(FTSEntry.match('leet desktop'))
         .order_by(FTSEntry.rank()))

If all of the information you need is contained in the FTS table, then you can use the FTSModel.search() helper. In the following snippet we'll search the FTSEntry table and return the first line of the indexed search content:

>>> for result in FTSEntry.search('sqlite AND berkeleydb'):
...     print(result.content.splitlines()[0], round(result.score, 2))
Building the python SQLite driver for use with BerkeleyDB -0.68
Migrating to SQLite -0.56
Saturday morning hack: personalized news digest with boolean query parser -0.14

And that is all there is to it! As I mentioned earlier, the playhouse.sqlite_ext module contains an implementation of the rank() function and will automatically register it on the SQLite connection. The sqlite_ext module also contains the FTSModel class which can be used for full-text search virtual tables.

Ranking results using the Okapi BM25 algorithm

If you're looking for a more sophisticated ranking algorithm, playhouse.sqlite_ext also contains an implementation of the Okapi BM25 algorithm, which is derived from this C code. This ranking method requires the n, a and l fields of the matchinfo structure, so this only works with FTS4 tables.

To use the BM25 algorithm, you can simply use the search_bm25 helper:

>>> for result in FTSEntry.search_bm25('javascript AND canvas'):
...     print(result.content.splitlines()[0], round(result.score, 2))
Even more Canvas fun - Tetris in JavaScript -4.07
Nokia Snake with JavaScript + Canvas -4.05
More fun with Canvas - a JavaScript Starfield! -3.91
Using python and k-means to find the dominant colors in images -1.79

If you are performing a JOIN or otherwise would like finer-grained control, you can write something like this instead:

>>> query = (Entry
...          .select(Entry, FTSEntry.bm25().alias('score'))
...          .join(FTSEntry, on=(Entry.id == FTSEntry.docid))
...          .where(FTSEntry.match('javascript AND canvas'))
...          .order_by(SQL('score').desc()))
>>> for entry in query:
...     print(entry.title, round(fts_entry.score, 2))
Even more Canvas fun - Tetris in JavaScript -4.07
Nokia Snake with JavaScript + Canvas -4.05
More fun with Canvas - a JavaScript Starfield! -3.91
Using python and k-means to find the dominant colors in images -1.79

The BM25 method supports several parameters for tuning, so if you're interested in more details, check out the documentation. For more control and direct access to the BM25 function, check out the BM25 helper docs.

Thanks for reading

Thank you for taking the time to read this post, I hope you found it interesting. If you have any questions, comments or suggestions, please feel free to leave a comment below or contact me.

Links

Here are some blog posts on similar topics which you may find interesting:

Comments (0)

Commenting has been closed.