May 12, 2014 19:12 / 0 comments / peewee python search sqlite

Full-text search with SQLite

In this post I will show how to use SQLite full-text search with Python (and a lot of help from peewee ORM). We will see how to index content for searching, how to perform searches, and how to order search results using two ranking algorithms.

Last week I migrated my site from Postgresql to SQLite. I had been using Redis to power my site's search, but since SQLite has an awesome full-text search extension, I decided to give it a try. I am really pleased with the results, and being able to specify boolean search queries is an added plus. Here is a brief overview of the types of search queries SQLite supports:

If you'd like to give it a try, here are a couple example searches on my blog:

Getting started

In order to use this cool SQLite feature, we need to use the playhouse.sqlite_ext.SqliteExtDatabase class for our application's database:

from playhouse.sqlite_ext import SqliteExtDatabase

db = SqliteExtDatabase('blog.db', threadlocals=True)

This database subclass comes with special support for SQLite extensions and virtual tables, which we will need to index content for searching. If you're curious, the threadlocals=True is important for multi-threaded (or green-threaded) servers, which, if you're using a production WSGI server, is probably what you have.

A note on virtual tables

In order to use full-text searches with SQLite, you need to create a special virtual table in the database for the content you wish to index. Rows are created in this table just like any other, but behind-the-scenes the FTS extension will create some indexes that will enable fast searches. Since we'll be using peewee ORM, we need to create a Model to represent the search content. The peewee SQLite helpers will allow us to create the model with the appropriate table definition. To create a virtual table for the FTS extension, simply subclass playhouse.sqlite_ext.FTSModel instead of the usual Model class.

Creating the Models

With our database configured, we'll create a really basic blog entry model that has a title and a content field. In reality you may have a number of other fields like status, timestamp, etc, but since the other columns have no effect on the rest of the implementation, I will skip them. Here is the model definition:

from peewee import *
from playhouse.sqlite_ext import *

db = SqliteExtDatabase('blog.db', threadlocals=True)

class Entry(Model):
    title = CharField()
    content = TextField()
    # You may have any number of other fields, such as status,
    # timestamp, etc.

    class Meta:
        database = db

We now need to create a special model for storing and indexing the entry content. Since each row in the index table will correspond exactly to one blog entry, we can specify the entry as the primary key. If you have multiple columns you wish to search, such as a title and a content field, you could add them to the FTSEntry model or just concatenate the data together and dump it into a single text field. For simplicity we'll do the latter and store everything in a single content field:

class FTSEntry(FTSModel):
    entry = ForeignKeyField(Entry, primary_key=True)
    content = TextField()

    class Meta:
        database = db

You may wonder why not just have all your searchable models be subclasses of FTSModel. I would advise against doing this because of some limitations to SQLite's virtual tables. Virtual tables do not support indexes, which can really hurt performance. Additionally the full-text virtual tables treat all columns as having the TEXT data-type, regardless of the actual column definition. So instead of using FTSModel for everything, typically you'll keep your data in a normal table, and then store the search content in a related virtual table.

Open up a python shell and create the tables:

# Create the tables.
>>> from search_example import Entry, FTSEntry
>>> Entry.create_table()
>>> FTSEntry.create_table()

After creating the tables from the python shell, we can introspect the schema and see that SQLite has created the FTSEntry table as well as several other tables to store search metadata:

sqlite> .schema
CREATE TABLE "entry" (
    "id" INTEGER NOT NULL PRIMARY KEY,
    "title" VARCHAR(255) NOT NULL,
    "content" TEXT NOT NULL);
CREATE VIRTUAL TABLE "ftsentry" USING FTS4 (
    "entry_id" INTEGER NOT NULL PRIMARY KEY,
    "content" TEXT NOT NULL,
    FOREIGN KEY ("entry_id") REFERENCES "entry" ("id"));
CREATE TABLE 'ftsentry_content'(...)
CREATE TABLE 'ftsentry_segments'(...)
CREATE TABLE 'ftsentry_segdir'(...)
CREATE TABLE 'ftsentry_docsize'(...)
CREATE TABLE 'ftsentry_stat'(...)

For this example I'm going to use the actual blog entries from site, but if you're following along and want to create some entries, you might write some code like this:

  entry = Entry.create(
      title='How I rewrote everything with golang',
      content='Blah blah blah, type system, channels, blurgh')
  FTSEntry.create(
      entry=entry,
      content='\n'.join((entry.title, entry.content)))

  entry = Entry.create(
      title='Why ORMs are a terrible idea',
      content='Blah blah blah, leaky abstraction, impedance mismatch')
  FTSEntry.create(
      entry=entry,
      content='\n'.join((entry.title, entry.content)))

As you can see, indexing content is as simple as creating a new row in the FTSEntry table. Coming up with good content is another story...

Performing searches

SQLite uses the MATCH operator to express a full-text search query, e.g.

sqlite> SELECT entry.title
   ...> FROM ftsentry
   ...> JOIN entry ON ftsentry.entry_id = entry.id
   ...> WHERE ftsentry MATCH 'django AND peewee AND flask';
Structuring flask apps, a how-to for those coming from Django
Peewee was baroque, so I rewrote it
The missing library: ad-hoc queries for your models
Using Redis Pub/Sub and IRC for Error Logging with Python
Don't sweat the small stuff - use flask blueprints
Integrating the flask microframework with the peewee ORM
Peewee, a lightweight Python ORM - Original Post

You can still use equality or LIKE comparisons with the FTSEntry table, but these will be slow because they involve a full table scan. The MATCH operator, on the other hand, makes use of the special full-text search indexes and is very performant. From peewee, we can perform simple search queries by using the FTSModel.match helper:

query = (FTSEntry
         .select(Entry.title)
         .join(Entry)
         .where(FTSEntry.match('javascript AND canvas'))
         .dicts())
for row_dict in query:
    print row_dict

# Prints...
{'title': u'Using python and k-means to find the dominant colors in images'}
{'title': u'Even more Canvas fun - Tetris in JavaScript'}
{'title': u'More fun with Canvas - a JavaScript Starfield!'}
{'title': u'Nokia Snake with JavaScript + Canvas'}

Suppose that we wanted to perform a search and retrieve the populated Entry object along with the results without incurring O(n) queries. In the previous code snippet we're doing a JOIN to get the content, but peewee will also allow us to get the entire Entry instance:

query = (FTSEntry
         .select(Entry, FTSEntry)
         .join(Entry)
         .where(FTSEntry.match('javascript AND canvas')))
for fts_entry in query:
    print fts_entry.entry.title

# Prints:
Using python and k-means to find the dominant colors in images
Even more Canvas fun - Tetris in JavaScript
More fun with Canvas - a JavaScript Starfield!
Nokia Snake with JavaScript + Canvas

Sorting by best match

By default, the results returned from a MATCH query are in an unspecified order. For full-text search to be useful, though, the results should be ordered by relevance. SQLite's FTS extension does not come with a relevance function per-se, but it is possible to extract metadata from the full-text index and get pretty good results. The SQLite documentation on FTS contains a brief appendix which describes one method I will demonstrate below. The idea is that we will use the SQLite matchinfo function, which returns information about term counts, to rank the result rows.

The blob returned by matchinfo() contains several fields:

If you are using FTS4, then there are several additional fields available:

Our first ranking function will calculate ∑(x1 / x2) for each phrase/column pair in a given row. We will implement this function in Python, then register it with the SQLite connection as a user-defined function.

Note

The peewee sqlite_ext module defines and automatically registers this function for you. This code is presented by way of explanation in case you're curious how things work.

The matchinfo() function returns a blob of 32-bit unsigned ints for each value, so we will need to convert this data to a Python list of int:

def _parse_match_info(buf):
    bufsize = len(buf)  # Length in bytes.
    return [struct.unpack('@I', buf[i:i+4])[0] for i in range(0, bufsize, 4)]

The length of the matchinfo data varies depending on the number of phrases in the search query, and the number of columns on the table being queried. Our rank function will determine the number of phrases and columns, then calculate a sum of counts, which we will treat as the rank of the document:

def rank(raw_match_info):
    # handle match_info called w/default args 'pcx' - based on the example rank
    # function http://sqlite.org/fts3.html#appendix_a
    match_info = _parse_match_info(raw_match_info)
    score = 0.0
    p, c = match_info[:2]
    for phrase_num in range(p):
        phrase_info_idx = 2 + (phrase_num * c * 3)
        for col_num in range(c):
            col_idx = phrase_info_idx + (col_num * 3)
            x1, x2 = match_info[col_idx:col_idx + 2]
            if x1 > 0:
                score += float(x1) / x2
    return score

Now that we have our ranking function ready, we can start to use it in our SQLite queries. The peewee SQLite helper module automatically handles registering this function for you, so it will just work.

Here is an example of the SQL we will now generate to order search results by relevance:

SELECT entry.title, rank(matchinfo(ftsentry)) AS score
FROM ftsentry
JOIN entry ON ftsentry.entry_id = entry.id
WHERE ftsentry MATCH '1337 desktop'
ORDER BY score DESC;

-- returns --
Ricing the Desktop: "Brown rice", 0.273
In Honor of Spring..., 0.152
leet, 0.148
Using SQLite Full-Text Search with Python, 0.090

The first result returned contains the best match, and has two occurrences of the term 1337, and four for desktop. The second result contains the term desktop three times. You can see that the ranking algorithm gave each row a score corresponding to the number of times the search term occurs.

Peewee provides convenience methods to handle this common use-case, so we can simply write:

query = (FTSEntry
         .select(Entry.content, FTSEntry.rank().alias('score'))
         .join(Entry)
         .where(FTSEntry.match('1337 desktop'))
         .order_by(SQL('score').desc()))

If all of the information you need is contained in the FTS table, then you can use the FTSModel.search() helper. In the following snippet we'll search the FTSEntry table and return the first line of the indexed search content:

>>> for result in FTSEntry.search('sqlite AND berkeleydb'):
...     print result.content.splitlines()[0], round(result.score, 2)
Building the python SQLite driver for use with BerkeleyDB 0.68
Migrating to SQLite 0.56
Saturday morning hack: personalized news digest with boolean query parser 0.14

And that is all there is to it! As I mentioned earlier, the playhouse.sqlite_ext module contains an implementation of the rank() function and will automatically register it on the SQLite connection. The sqlite_ext module also contains the FTSModel class which can be used for full-text search virtual tables.

Ranking results using the Okapi BM25 algorithm

If you're looking for a more sophisticated ranking algorithm, playhouse.sqlite_ext also contains an implementation of the Okapi BM25 algorithm, which is derived from this C code. This ranking method requires the n, a and l fields of the matchinfo structure, so this only works with FTS4 tables.

To use the BM25 algorithm, you can simply use the search_bm25 helper:

>>> for result in FTSEntry.search_bm25('javascript AND canvas'):
...     print result.content.splitlines()[0], round(result.score, 2)
Even more Canvas fun - Tetris in JavaScript 4.07
Nokia Snake with JavaScript + Canvas 4.05
More fun with Canvas - a JavaScript Starfield! 3.91
Using python and k-means to find the dominant colors in images 1.79

If you are performing a JOIN or otherwise would like finer-grained control, you can write something like this instead:

>>> query = (FTSEntry
...          .select(
...              Entry,
...              FTSEntry,
...              FTSEntry.bm25(FTSEntry.content).alias('score'))
...          .join(Entry)
...          .where(FTSEntry.match('javascript AND canvas'))
...          .order_by(SQL('score').desc()))
>>> for fts_entry in query:
...     print fts_entry.entry.title, round(fts_entry.score, 2)
Even more Canvas fun - Tetris in JavaScript 4.07
Nokia Snake with JavaScript + Canvas 4.05
More fun with Canvas - a JavaScript Starfield! 3.91
Using python and k-means to find the dominant colors in images 1.79

The BM25 method supports several parameters for tuning, so if you're interested in more details, check out the documentation. For more control and direct access to the BM25 function, check out the BM25 helper docs.

Thanks for reading

Thank you for taking the time to read this post, I hope you found it interesting. If you have any questions, comments or suggestions, please feel free to leave a comment below or contact me.

Links

Comments (0)


Commenting has been closed, but please feel free to contact me