Using the SQLite JSON1 and FTS5 Extensions with Python

Author's note: the instructions posted here are out-of-date. You can find up-to-date, detailed instructions in my more recent post, Compiling SQLite for use with Python applications.

Additionally, many of the code examples were written for a now obsolete version of Peewee. Check out the Peewee documentation for information and example code for FTS5 and JSON extensions.


Back in September, word started getting around trendy programming circles about a new file that had appeared in the SQLite fossil repo named json1.c. I originally wrote up a post that contained some gross hacks in order to get pysqlite to compile and work with the new json1 extension. With the release of SQLite 3.9.0, those hacks are no longer necessary.

SQLite 3.9.0 is a fantastic release. In addition to the much anticipated json1 extension, there is a new version of the full-text search extension called fts5. fts5 improves performance of complex search queries and provides an out-of-the-box BM25 ranking implementation. You can also boost the significance of particular fields in the ranking. I suggest you check out the release notes for the full list of enhancements.

This post will describe how to compile SQLite with support for json1 and fts5. We'll use the new SQLite library to compile a python driver so we can use the new features from python. Because I really like pysqlite and apsw, I've included instructions for building both of them. Finally, we'll use peewee ORM to run queries using the json1 and fts5 extensions.

Getting started

We'll be grabbing the latest SQLite source code. To do this you can use fossil, the source-code management system used by SQLite, or alternatively you can pull down a compressed image. SQLite also requires tcl and awk to create the source amalgamation, so before starting in, you'll need to install:

There were a couple steps involved, so I'll try and break it down into sub-steps. To get started, we need a nice clean home for the new libraries. I put mine in ~/bin/jqlite for fun, but you can change the path to whatever you like.

export JQLITE="$HOME/bin/jqlite"
mkdir -p $JQLITE
cd $JQLITE

To use fossil to obtain the code, run the following commands:

fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
fossil open sqlite.fossil

To fetch the snapshot tarball, run the following commands:

curl 'https://www.sqlite.org/src/tarball/sqlite.tar.gz?ci=trunk' | tar xz
mv sqlite/* .

If you prefer to use an official release, you can download one of the autoconf tarballs from the SQLite downloads page. Extract the contents into the $JQLITE directory.

Compiling SQLite with json1 and fts5

After downloading the code, you should now be in a directory alongside the SQLite source tree. SQLite supports a ton of compile configuration options. Besides json1 and fts5, I've included a number of other options I find useful.

Compilation follows the typical configure -> make -> make install sequence:

export CFLAGS="-DSQLITE_ENABLE_COLUMN_METADATA \
-DSQLITE_ENABLE_DBSTAT_VTAB \
-DSQLITE_ENABLE_FTS3 \
-DSQLITE_ENABLE_FTS3_PARENTHESIS \
-DSQLITE_ENABLE_FTS4 \
-DSQLITE_ENABLE_FTS5 \
-DSQLITE_ENABLE_JSON1 \
-DSQLITE_ENABLE_STAT4 \
-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \
-DSQLITE_SECURE_DELETE \
-DSQLITE_SOUNDEX \
-DSQLITE_TEMP_STORE=3 \
-O2 \
-fPIC"
LIBS="-lm" ./configure --prefix=$JQLITE --enable-static --enable-shared
make
make install

There should now be a file named lib/libsqlite3.a in the SQLite3 source checkout. If not, cruise the console output and hopefully the error is easy to spot. I've done this now on arch and ubuntu, but I'm not sure about fapple or windoze.

Building pysqlite

pysqlite should be familiar to most python developers, as it is more or less the same as the sqlite3 module in python's standard library. To build pysqlite against our new libsqlite3, really the only thing we need to do is modify the setup.cfg file to point it at the include and lib directories we just created.

git clone https://github.com/ghaering/pysqlite
cd pysqlite/
cp ../sqlite3.c .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE/include" >> setup.cfg
LIBS="-lm" python setup.py build_static

To test the install, you can cd into the build/lib.linux-xfoobar/ directory, open a python interpreter, and run the following:

>>> from pysqlite2 import dbapi2 as sqlite
>>> conn = sqlite.connect(':memory:')
>>> conn.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<pysqlite2.dbapi2.Cursor object at 0x7ff7d0a2dc60>
>>> conn.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)

Phew. Now it's your choice, you can run python setup.py install, or symlink the newly-built pysqlite2 (look in build/lib.linux.../) into your $PYTHONPATH. If you have a virtualenv you want to use the new pysqlite with, you can activate the virtualenv, then switch back to the pysqlite directory and run setup.py install.

Building apsw

Building apsw is almost identical to building pysqlite.

cd $JQLITE
git clone https://github.com/rogerbinns/apsw
cd apsw
cp ../sqlite3{ext.h,.h,.c} .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE/include" >> setup.cfg
LIBS="-lm" python setup.py build

To test the new apsw library, change directories into build/libXXX, open a python interpreter, and run the following:

>>> import apsw
>>> conn = apsw.Connection(':memory:')
>>> cursor = conn.cursor()
>>> cursor.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<apsw.Cursor at 0x7fcf6b17fa80>
>>> cursor.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)

You can run python setup.py install to install the new apsw system-wide, or symlink the apsw.so library (look in build/lib.linux.../) into your $PYTHONPATH. If you have a virtualenv you want to use apsw with, you can activate the virtualenv, then switch back to the apsw directory and run setup.py install.

Using the JSON extension

There are some really neat features in the json1 extension, particularly the json_tree and json_each functions/virtual tables (documented here). To demonstrate the new features, we'll use peewee, a small Python ORM, to write some JSON data then query it.

I originally thought to get test data from GitHub's API, but in order to show off the features with minimum verbosity, I've instead contrived a little JSON file which can be viewed here. It is structured like so:

[{
   "title": "My List of Python and SQLite Resources",
   "url": "http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/",
   "metadata": {"tags": ["python", "sqlite"]}
 },
 {
   "title": "Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python"
   "url": "http://charlesleifer.com/blog/using-sqlite4-s-lsm-storage-engine-as-a-stand-alone-nosql-database-with-python/",
   "metadata": {"tags": ["nosql", "python", "sqlite", "cython"]}
  },
  ...
]

If you prefer to view the code in IPython notebook format, you can view the notebook here.

Populating the database

We'll start by fetching the JSON data file and decoding it:

>>> import json, urllib2
>>> fh = urllib2.urlopen('http://media.charlesleifer.com/blog/downloads/misc/blogs.json')
>>> data = json.loads(fh.read())
>>> data[0]
{u'metadata': {u'tags': [u'python', u'sqlite']},
 u'title': u'My List of Python and SQLite Resources',
 u'url': u'http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/'}

Now we need to tell peewee how to access our database, so we'll wrap the SQLite database to use our custom pysqlite. I'm importing pysqlite2, which is the module we just compiled, but I'm aliasing it to jqlite just so there's no confusion. After the database class has been defined, we'll create an in-memory database. (note: in the next release, 2.6.5, peewee will use pysqlite2 automatically if it is compiled against a newer version than sqlite3).

>>> from pysqlite2 import dbapi2 as jqlite
>>> from peewee import *
>>> from playhouse.sqlite_ext import *
>>> class JQLiteDatabase(SqliteExtDatabase):
...     def _connect(self, database, **kwargs):
...         conn = jqlite.connect(database, **kwargs)
...         conn.isolation_level = None
...         self._add_conn_hooks(conn)
...         return conn
...
>>> db = JQLiteDatabase(':memory:')

Populating the database with our JSON data is pretty straightforward. We'll create a generic table with a single TEXT field. At present, SQLite does not expose a separate column/data-type for JSON data, so we will use TextField:

>>> class Entry(Model):
...     data = TextField()
...     class Meta:
...         database = db
...
>>> Entry.create_table()
>>> with db.atomic():
...     for entry_json in data:
...         Entry.create(data=json.dumps(entry_json))
...

JSON functions

We'll start by looking at json_extract(). It takes a dotted / bracketed path describing the element to find (unlike postgres which always uses []). Each Entry in the database contains a single data column which contains a JSON object. This JSON object has a title, url and metadata key at the top-level. Let's see how to extract the titles of our entries:

>>> title = fn.json_extract(Entry.data, '$.title')
>>> query = (Entry
...          .select(title.alias('title'))
...          .order_by(title)
...          .limit(5))
...
>>> [row for row in query.dicts()]
[{'title': u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More'},
 {'title': u'Alternative Redis-Like Databases with Python'},
 {'title': u'Building the SQLite FTS5 Search Extension'},
 {'title': u'Connor Thomas Leifer'},
 {'title': u'Extending SQLite with Python'}]

The query we created corresponds to the following SQL:

SELECT json_extract("t1"."data", '$.title') AS title
FROM "entry" AS t1
ORDER BY json_extract("t1"."data", '$.title')
LIMIT 5

In the next example we will extract the titles of entries that contain the given tag. To search the tags list, we will use a new function json_each(). This function behaves like a table (and in fact refers to a virtual table), and returns the direct children of the specified JSON path. Here is how to find the titles of entries tagged with "Sqlite":

>>> from peewee import Entity
>>> tags_src = fn.json_each(Entry.data, '$.metadata.tags').alias('tags')
>>> tags_ref = Entity('tags')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tags_src)
...          .where(tags_ref.value == 'sqlite')
...          .order_by(title))
...
>>> [row for row, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

The SQL for the above query may help elucidate what's going on:

SELECT json_extract("t1"."data", '$.title') AS title
FROM
    "entry" AS t1,
    json_each("t1"."data", '$.metadata.tags') AS tags
WHERE ("tags"."value" = 'sqlite')
ORDER BY json_extract("t1"."data", '$.title')

As queries grow more complex, being able to encapsulate pieces of the query with Peewee objects becomes more useful and can facilitate code reuse.

Here is another example of json_each(). This time we will select the title of each entry, and build a comma-separated string of their associated tags. We will re-use the tags_src and tags_ref definitions created above.

>>> query = (Entry
...          .select(
...              title.alias('title'),
...              fn.group_concat(tags_ref.value, ', ').alias('tags'))
...          .from_(Entry, tags_src)
...          .group_by(title)
...          .limit(5))
...
>>> [row for row in query.tuples()]
[(u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More',
  u'peewee, sql, python'),
 (u'Alternative Redis-Like Databases with Python',
  u'python, walrus, redis, nosql'),
 (u'Building the SQLite FTS5 Search Extension',
  u'sqlite, search, python, peewee'),
 (u'Connor Thomas Leifer', u'thoughts'),
 (u'Extending SQLite with Python', u'peewee, python, sqlite')]

Again, for clarity, here is the corresponding SQL query:

SELECT
    json_extract("t1"."data", '$.title') AS title,
    group_concat("tags"."value", ', ') AS tags
FROM
    "entry" AS t1,
    json_each("t1"."data", '$.metadata.tags') AS tags
GROUP BY json_extract("t1"."data", '$.title')
LIMIT 5

The last function I'll cover is json_tree(). Like json_each(), json_tree() is a multi-valued function that behaves like a table. Where json_each() only returns the direct descendants of the specified path, json_tree() will recursively traverse the object, returning all children.

If the tags key was nested somewhere arbitrarily deep for each entry, here is how we might find entries that matched a given tag:

>>> tree = fn.json_tree(Entry.data, '$').alias('tree')
>>> parent = fn.json_tree(Entry.data, '$').alias('parent')

>>> tree_ref = Entity('tree')
>>> parent_ref = Entity('parent')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tree, parent)
...          .where(
...              (tree_ref.parent == parent_ref.id) &
...              (parent_ref.key == 'tags') &
...              (tree_ref.value == 'sqlite'))
...          .order_by(title))
...
>>> [title for title, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

What's going on in the above code is we are selecting the Entry itself, along with two trees representing the entry's child nodes. Since each tree node contains a reference to it's parent, we can simply search for a parent node named "tags" that contains a child node with the value "sqlite".

Here it is in SQL:

SELECT json_extract("t1"."data", '$.title') AS title
FROM
    "entry" AS t1,
    json_tree("t1"."data", '$') AS tree,
    json_tree("t1"."data", '$') AS parent
WHERE (
    ("tree"."parent" = "parent"."id") AND
    ("parent"."key" = 'tags') AND
    ("tree"."value" = 'sqlite'))
ORDER BY json_extract("t1"."data", '$.title')

This is just a sampling of the functionality in the json1 extension, and I'm looking forward to experimenting with it more over the coming weeks. Please feel free to leave a comment here, or if you have specific questions about the extension, you might wish to refer to the sqlite-users mailing list.

FTS5 with Python

The code in this section is a continuation of the previous JSON example code, as we'll be using the titles from the Entry data file and using them to populate a search index. The FTS5Model feature of peewee will be included in the 2.6.5 release, and is currently available in the master branch on github.

Picking up from where we left off in the JSON examples, let's create another table that will be our search index for Entry data.

The fts5 extension requires that all columns be completely devoid of any types or constraints. The only additional information we can provide is to indicate a column is unindexed, meaning that the data is stored but not searchable.

Let's define a search index for the entry model that allows us to search over the titles and determine the associated URL. To do this we'll leave the url field as unindexed.

class EntryIndex(FTS5Model):
    title = SearchField()
    url = SearchField(unindexed=True)

    class Meta:
        database = db
        options = {
            'prefix': [2, 3],
            'tokenize': 'porter unicode61', 
        }

EntryIndex.create_table()

The options dictionary provides additional metadata to the fts5 extension telling it how to tokenize words, and what lengths of prefixes to store for fast prefix search. The SQL for the CREATE TABLE looks like this:

CREATE VIRTUAL TABLE "entryindex" USING fts5 (
    "title" ,
    "url"  UNINDEXED,
    prefix='2,3',
    tokenize="porter unicode61")

To populate our index, we'll use a couple JSON functions to copy data over from the Entry model:

title = fn.json_extract(Entry.data, '$.title').alias('title')
url = fn.json_extract(Entry.data, '$.url').alias('url')
query = Entry.select(title, url).dicts()
with db.atomic():
    for entry in query:
        EntryIndex.create(**entry)

Now that our index is populated, let's perform a few searches:

>>> query = EntryIndex.search('sqlite').limit(3)
>>> for result in query:
...     print result.title

Extending SQLite with Python
Building the SQLite FTS5 Search Extension
My List of Python and SQLite Resources

The SQL for the above query looks like this:

SELECT "t1"."title", "t1"."url"
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite')
ORDER BY rank

We can retrieve the score of our results as well:

>>> query = EntryIndex.search('sqlite AND python', with_score=True)
>>> for result in query:
...     print round(result.score, 3), result.title

-1.259 Extending SQLite with Python
-1.059 My List of Python and SQLite Resources
-0.838 Querying Tree Structures in SQLite using Python and the Transitive Closure Extension

Notice how sensible those results are? The SQL for the above query looks like:

SELECT "t1"."title", "t1"."url", rank AS score
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite AND python')
ORDER BY rank

I've barely scratched the surface of the functionality provided by the fts5 extension, but if you check out the documentation you'll find a lot of great features. Here are some examples:

Thanks for reading

I hope you enjoyed reading this post. I think the addition of a JSON extension to SQLite is very exciting for both the project and for its users. Postgresql and MySQL both have their own JSON implementations, so it's good to see SQLite is not lagging behind. Having JSON also may obviate the need, in some cases, to reach for dedicated embedded document stores like UnQLite.

It's also worth noting that the file is named json1.c. This is an effort, on Dr. Hipp's part, to show that this is just the first step, and that there will be more in the future. So whatever issues there may be with the current implementation, I'm sure future releases will continue to improve both the performance and APIs. For one, I believe he is looking into a binary format that would be more performant.

I'm also excited to see SQLite continuing to improve and innovate on the quality of it's full-text search extension. Providing a built-in ranking algorithm and an API for users to implement their own is a much-needed addition.

Links of interest:

Other blog posts you may like:

And, just to put this out there ahead of time, here are some responses to the news of SQLite's JSON module. Don't be one of these people!

Comments (3)

Charles | nov 18 2015, at 11:50am

Nice post Oisin, thanks for sharing!

Oisin Mulvihill | nov 18 2015, at 11:46am

Thanks for this excellent article.

I was wondering if I could cut out the source compile steps on OSX with homebrew. Compiling from source is better and gets the latest versions however homebrew saves a lot of time on OSX just to play around. It turned out I could and I was able to get sqlite3, json & ft5 extensions working with a pip installed pysqlite. I show how here http://www.sourceweaver.com/post/133474396355/homebrew-install-of-sqlite-json-extension

Anonymous | sep 17 2015, at 05:00pm

fapple

I am going to love you for this more than I already love you for peewee.


Commenting has been closed.