SQLite Spellfix Extension and Python

The SQLite source tree is full of wonders. There is the the lemon parser generator, a btree implementation (well, kind of not surprising), multiple search engines, a json library, and more. It looks like Dr. Hipp is also experimenting with integrating the LSM key/value store from SQLite4 as a standalone virtual table.

I've written about the json and full-text search extensions, the lsm key/value store, and the transitive closure extension (useful when querying hierarchical data). In this post I'll be covering another interesting extension, the spellfix1 extension (documentation).

The spellfix1 extension is useful for providing suggestions for misspelled words, but could also be used alongside a full-text search implementation to provide corrections for commonly misspelled words in searches. If you have a mobile app, you could use it to correct misspellings due to...mobile. It's a snap to use the extension, so let's compile it and give it a shot. I'll be using Peewee, a lightweight Python ORM, to communicate with the database.

Building the extension

There are a number of ways you can get the source code for the extension. You can download the entire source using the 3.9.2 snapshot, or alternatively just fetch the spellfix.c file:

$ wget http://sqlite.org/cgi/src/raw/ext/misc/spellfix.c?name=b9065af7ab1f2597b505a8aa9892620866d502fc -O spellfix1.c

If you downloaded the source snapshot, you can find the file in the ext/misc/ subdirectory.

We'll compile the extension into a shared library:

$ gcc -fPIC -shared spellfix1.c -o spellfix1.so

You should now have a spellfix1.so file in your current directory.

Setting up the database

The spellfix1 extension creates a virtual table with a number of columns that provide metadata about the words listed in the table. To get started, we'll create a new Python file (spelling.py) and define our database and spelling table:

from peewee import *
from playhouse.sqlite_ext import *

db = SqliteExtDatabase('spelling.db')

# Configure the database to load the `spellfix1.so` extension
# whenever a connection is opened.
db.load_extension('spellfix1')

class Spellfix(VirtualModel):
    _extension = 'spellfix1'
    word = VirtualCharField()
    rank = VirtualIntegerField()
    distance = VirtualIntegerField()
    langid = VirtualIntegerField()
    score = VirtualFloatField()
    matchlen = VirtualIntegerField()
    phonehash = VirtualCharField()
    top = VirtualIntegerField()
    scope = VirtualFloatField()
    soundslike = VirtualCharField()

    class Meta:
        database = db
        primary_key = False

# Create the table if it does not exist.
db.create_tables([Spellfix], safe=True)

When populating the spellfix table, all that is required is a value for the word column. The langid can be specified if you plan on supporting multiple languages, and if you would like to associate weights with words, you can specify a value for the rank column. In this post we'll just specify a value for the word column.

The spellfix table is now ready to be populated with a word list. If you're a linux user there is typically a word list in /usr/share/dict/words. On arch linux, I just installed the words package (sudo pacman -S words) - I imagine there's a similar process for noobuntu/fapple users.

Add the following code to populate the table:

WORD_FILE = '/usr/share/dict/words'

# Populate the table if it hasn't been already.
if not Spellfix.select().exists():
    with open(WORD_FILE) as fh:
        with db.atomic():
            for word in fh:
                Spellfix.insert({Spellfix.word: word.rstrip()}).execute()

If you run the script at this point, it should take a minute to finish, after which you'll have a SQLite database named spelling.db which has been populated with our list of words.

Getting spelling suggestions

To use the extension, we will query the Spellfix table's word column using the special SQLite MATCH operator. When we query the table, the extension will calculate the edit distance between the input word and a potential match in the database. The edit distance, combined with the rank provided when the table was created, determines the score of the match.

Here is a short snippet of code you can put at the end of the script to test the extension out:

while True:
    phrase = raw_input('Enter phrase: ')
    if not phrase or phrase == 'q':
        break
    accum = []
    for word in phrase.split():
        query = (Spellfix
                 .select(Spellfix.word, Spellfix.score)
                 .where(
                     match(Spellfix.word, word) &
                     (Spellfix.top == 5))
                 .dicts())
        print word
        for result in query:
            print '    ', result['word'], result['score']

Let's try running the script now and seeing how things work:

$ python spelling.py
Enter phrase: hello fellowe pythonn users
hello
     hello 31.0
     Eloy 45.0
     hello's 56.0
     hellos 56.0
     helot 66.0
fellowe
     fallowed 96.0
     bellowed 96.0
     followed 96.0
     follower 96.0
     Felipe 121.0
pythonn
     Python 41.0
     python 41.0
     Python's 66.0
     pythons 66.0
     python's 66.0
users
     users 31.0
     user's 32.0
     ushers 32.0
     usher's 33.0
     USSR's 62.0
     USSR's 62.0

Enter phrase: foo bar baz baze
foo
     foo 31.0
     foe 44.0
     food 56.0
     foot 56.0
     Foch 66.0
bar
     bar 31.0
     Barr 33.0
     bare 36.0
     barre 38.0
     Barry 38.0
baz
     Baez 46.0
     bash 71.0
     bag 71.0
     biz 71.0
     Bach 71.0
baze
     baize 46.0
     bake 71.0
     base 71.0
     baize's 71.0
     faze 71.0

Pretty neat! If you'd like to learn more about how to use the spellfix1 extension, including using the "sounds like" feature and integrating with full-text search, check out the official extension documentation.

Thanks for reading, I hope you found this post interesting! If you'd like to check out some similar posts, check out the sqlite tag on my blog as it's one of my favorite topics lately!

Comments (1)

Tom A | dec 06 2015, at 05:32am

Many thanks for this and all the other excellent Flask / Peewee / SQLite posts Charles.

I've learned so much from your blog - do keep the posts coming :-)


Commenting has been closed, but please feel free to contact me