Using the SQLite JSON1 and FTS5 Extensions with Python
Author's note: the instructions posted here are out-of-date. You can find up-to-date, detailed instructions in my more recent post, Compiling SQLite for use with Python applications.
Additionally, many of the code examples were written for a now obsolete version of Peewee. Check out the Peewee documentation for information and example code for FTS5 and JSON extensions.
Back in September, word started getting around trendy programming circles about a new file that had appeared in the SQLite fossil repo named json1.c. I originally wrote up a post that contained some gross hacks in order to get pysqlite to compile and work with the new json1
extension. With the release of SQLite 3.9.0, those hacks are no longer necessary.
SQLite 3.9.0 is a fantastic release. In addition to the much anticipated json1
extension, there is a new version of the full-text search extension called fts5
. fts5
improves performance of complex search queries and provides an out-of-the-box BM25 ranking implementation. You can also boost the significance of particular fields in the ranking. I suggest you check out the release notes for the full list of enhancements.
This post will describe how to compile SQLite with support for json1
and fts5
. We'll use the new SQLite library to compile a python driver so we can use the new features from python. Because I really like pysqlite
and apsw
, I've included instructions for building both of them. Finally, we'll use peewee ORM to run queries using the json1
and fts5
extensions.
Getting started
We'll be grabbing the latest SQLite source code. To do this you can use fossil
, the source-code management system used by SQLite, or alternatively you can pull down a compressed image. SQLite also requires tcl
and awk
to create the source amalgamation, so before starting in, you'll need to install:
- tcl
- awk (available on most unix systems)
- fossil (optional)
There were a couple steps involved, so I'll try and break it down into sub-steps. To get started, we need a nice clean home for the new libraries. I put mine in ~/bin/jqlite
for fun, but you can change the path to whatever you like.
export JQLITE="$HOME/bin/jqlite"
mkdir -p $JQLITE
cd $JQLITE
To use fossil to obtain the code, run the following commands:
fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
fossil open sqlite.fossil
To fetch the snapshot tarball, run the following commands:
curl 'https://www.sqlite.org/src/tarball/sqlite.tar.gz?ci=trunk' | tar xz
mv sqlite/* .
If you prefer to use an official release, you can download one of the autoconf tarballs from the SQLite downloads page. Extract the contents into the $JQLITE
directory.
Compiling SQLite with json1 and fts5
After downloading the code, you should now be in a directory alongside the SQLite source tree. SQLite supports a ton of compile configuration options. Besides json1
and fts5
, I've included a number of other options I find useful.
Compilation follows the typical configure
-> make
-> make install
sequence:
export CFLAGS="-DSQLITE_ENABLE_COLUMN_METADATA \
-DSQLITE_ENABLE_DBSTAT_VTAB \
-DSQLITE_ENABLE_FTS3 \
-DSQLITE_ENABLE_FTS3_PARENTHESIS \
-DSQLITE_ENABLE_FTS4 \
-DSQLITE_ENABLE_FTS5 \
-DSQLITE_ENABLE_JSON1 \
-DSQLITE_ENABLE_STAT4 \
-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \
-DSQLITE_SECURE_DELETE \
-DSQLITE_SOUNDEX \
-DSQLITE_TEMP_STORE=3 \
-O2 \
-fPIC"
LIBS="-lm" ./configure --prefix=$JQLITE --enable-static --enable-shared
make
make install
There should now be a file named lib/libsqlite3.a
in the SQLite3 source checkout. If not, cruise the console output and hopefully the error is easy to spot. I've done this now on arch and ubuntu, but I'm not sure about fapple or windoze.
Building pysqlite
pysqlite should be familiar to most python developers, as it is more or less the same as the sqlite3
module in python's standard library. To build pysqlite
against our new libsqlite3
, really the only thing we need to do is modify the setup.cfg
file to point it at the include
and lib
directories we just created.
git clone https://github.com/ghaering/pysqlite
cd pysqlite/
cp ../sqlite3.c .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE/include" >> setup.cfg
LIBS="-lm" python setup.py build_static
To test the install, you can cd
into the build/lib.linux-xfoobar/
directory, open a python interpreter, and run the following:
>>> from pysqlite2 import dbapi2 as sqlite
>>> conn = sqlite.connect(':memory:')
>>> conn.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<pysqlite2.dbapi2.Cursor object at 0x7ff7d0a2dc60>
>>> conn.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)
Phew. Now it's your choice, you can run python setup.py install
, or symlink the newly-built pysqlite2
(look in build/lib.linux.../
) into your $PYTHONPATH
. If you have a virtualenv you want to use the new pysqlite
with, you can activate the virtualenv, then switch back to the pysqlite
directory and run setup.py install
.
Building apsw
Building apsw is almost identical to building pysqlite
.
cd $JQLITE
git clone https://github.com/rogerbinns/apsw
cd apsw
cp ../sqlite3{ext.h,.h,.c} .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE/include" >> setup.cfg
LIBS="-lm" python setup.py build
To test the new apsw
library, change directories into build/libXXX
, open a python interpreter, and run the following:
>>> import apsw
>>> conn = apsw.Connection(':memory:')
>>> cursor = conn.cursor()
>>> cursor.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<apsw.Cursor at 0x7fcf6b17fa80>
>>> cursor.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)
You can run python setup.py install
to install the new apsw
system-wide, or symlink the apsw.so
library (look in build/lib.linux.../
) into your $PYTHONPATH
. If you have a virtualenv you want to use apsw
with, you can activate the virtualenv, then switch back to the apsw
directory and run setup.py install
.
Using the JSON extension
There are some really neat features in the json1
extension, particularly the json_tree
and json_each
functions/virtual tables (documented here). To demonstrate the new features, we'll use peewee, a small Python ORM, to write some JSON data then query it.
I originally thought to get test data from GitHub's API, but in order to show off the features with minimum verbosity, I've instead contrived a little JSON file which can be viewed here. It is structured like so:
[{
"title": "My List of Python and SQLite Resources",
"url": "http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/",
"metadata": {"tags": ["python", "sqlite"]}
},
{
"title": "Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python"
"url": "http://charlesleifer.com/blog/using-sqlite4-s-lsm-storage-engine-as-a-stand-alone-nosql-database-with-python/",
"metadata": {"tags": ["nosql", "python", "sqlite", "cython"]}
},
...
]
If you prefer to view the code in IPython notebook format, you can view the notebook here.
Populating the database
We'll start by fetching the JSON data file and decoding it:
>>> import json, urllib2
>>> fh = urllib2.urlopen('http://media.charlesleifer.com/blog/downloads/misc/blogs.json')
>>> data = json.loads(fh.read())
>>> data[0]
{u'metadata': {u'tags': [u'python', u'sqlite']},
u'title': u'My List of Python and SQLite Resources',
u'url': u'http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/'}
Now we need to tell peewee how to access our database, so we'll wrap the SQLite database to use our custom pysqlite
. I'm importing pysqlite2
, which is the module we just compiled, but I'm aliasing it to jqlite
just so there's no confusion. After the database class has been defined, we'll create an in-memory database. (note: in the next release, 2.6.5, peewee will use pysqlite2
automatically if it is compiled against a newer version than sqlite3
).
>>> from pysqlite2 import dbapi2 as jqlite
>>> from peewee import *
>>> from playhouse.sqlite_ext import *
>>> class JQLiteDatabase(SqliteExtDatabase):
... def _connect(self, database, **kwargs):
... conn = jqlite.connect(database, **kwargs)
... conn.isolation_level = None
... self._add_conn_hooks(conn)
... return conn
...
>>> db = JQLiteDatabase(':memory:')
Populating the database with our JSON data is pretty straightforward. We'll create a generic table with a single TEXT
field. At present, SQLite does not expose a separate column/data-type for JSON data, so we will use TextField
:
>>> class Entry(Model):
... data = TextField()
... class Meta:
... database = db
...
>>> Entry.create_table()
>>> with db.atomic():
... for entry_json in data:
... Entry.create(data=json.dumps(entry_json))
...
JSON functions
We'll start by looking at json_extract()
. It takes a dotted / bracketed path describing the element to find (unlike postgres which always uses []
). Each Entry
in the database contains a single data
column which contains a JSON object. This JSON object has a title
, url
and metadata
key at the top-level. Let's see how to extract the titles of our entries:
>>> title = fn.json_extract(Entry.data, '$.title')
>>> query = (Entry
... .select(title.alias('title'))
... .order_by(title)
... .limit(5))
...
>>> [row for row in query.dicts()]
[{'title': u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More'},
{'title': u'Alternative Redis-Like Databases with Python'},
{'title': u'Building the SQLite FTS5 Search Extension'},
{'title': u'Connor Thomas Leifer'},
{'title': u'Extending SQLite with Python'}]
The query we created corresponds to the following SQL:
SELECT json_extract("t1"."data", '$.title') AS title
FROM "entry" AS t1
ORDER BY json_extract("t1"."data", '$.title')
LIMIT 5
In the next example we will extract the titles of entries that contain the given tag. To search the tags list, we will use a new function json_each()
. This function behaves like a table (and in fact refers to a virtual table), and returns the direct children of the specified JSON path. Here is how to find the titles of entries tagged with "Sqlite":
>>> from peewee import Entity
>>> tags_src = fn.json_each(Entry.data, '$.metadata.tags').alias('tags')
>>> tags_ref = Entity('tags')
>>> query = (Entry
... .select(title.alias('title'))
... .from_(Entry, tags_src)
... .where(tags_ref.value == 'sqlite')
... .order_by(title))
...
>>> [row for row, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
u'Extending SQLite with Python',
u'Meet Scout, a Search Server Powered by SQLite',
u'My List of Python and SQLite Resources',
u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
u'Web-based SQLite Database Browser, powered by Flask and Peewee']
The SQL for the above query may help elucidate what's going on:
SELECT json_extract("t1"."data", '$.title') AS title
FROM
"entry" AS t1,
json_each("t1"."data", '$.metadata.tags') AS tags
WHERE ("tags"."value" = 'sqlite')
ORDER BY json_extract("t1"."data", '$.title')
As queries grow more complex, being able to encapsulate pieces of the query with Peewee objects becomes more useful and can facilitate code reuse.
Here is another example of json_each()
. This time we will select the title of each entry, and build a comma-separated string of their associated tags. We will re-use the tags_src
and tags_ref
definitions created above.
>>> query = (Entry
... .select(
... title.alias('title'),
... fn.group_concat(tags_ref.value, ', ').alias('tags'))
... .from_(Entry, tags_src)
... .group_by(title)
... .limit(5))
...
>>> [row for row in query.tuples()]
[(u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More',
u'peewee, sql, python'),
(u'Alternative Redis-Like Databases with Python',
u'python, walrus, redis, nosql'),
(u'Building the SQLite FTS5 Search Extension',
u'sqlite, search, python, peewee'),
(u'Connor Thomas Leifer', u'thoughts'),
(u'Extending SQLite with Python', u'peewee, python, sqlite')]
Again, for clarity, here is the corresponding SQL query:
SELECT
json_extract("t1"."data", '$.title') AS title,
group_concat("tags"."value", ', ') AS tags
FROM
"entry" AS t1,
json_each("t1"."data", '$.metadata.tags') AS tags
GROUP BY json_extract("t1"."data", '$.title')
LIMIT 5
The last function I'll cover is json_tree()
. Like json_each()
, json_tree()
is a multi-valued function that behaves like a table. Where json_each()
only returns the direct descendants of the specified path, json_tree()
will recursively traverse the object, returning all children.
If the tags
key was nested somewhere arbitrarily deep for each entry, here is how we might find entries that matched a given tag:
>>> tree = fn.json_tree(Entry.data, '$').alias('tree')
>>> parent = fn.json_tree(Entry.data, '$').alias('parent')
>>> tree_ref = Entity('tree')
>>> parent_ref = Entity('parent')
>>> query = (Entry
... .select(title.alias('title'))
... .from_(Entry, tree, parent)
... .where(
... (tree_ref.parent == parent_ref.id) &
... (parent_ref.key == 'tags') &
... (tree_ref.value == 'sqlite'))
... .order_by(title))
...
>>> [title for title, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
u'Extending SQLite with Python',
u'Meet Scout, a Search Server Powered by SQLite',
u'My List of Python and SQLite Resources',
u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
u'Web-based SQLite Database Browser, powered by Flask and Peewee']
What's going on in the above code is we are selecting the Entry itself, along with two trees representing the entry's child nodes. Since each tree node contains a reference to it's parent, we can simply search for a parent node named "tags" that contains a child node with the value "sqlite".
Here it is in SQL:
SELECT json_extract("t1"."data", '$.title') AS title
FROM
"entry" AS t1,
json_tree("t1"."data", '$') AS tree,
json_tree("t1"."data", '$') AS parent
WHERE (
("tree"."parent" = "parent"."id") AND
("parent"."key" = 'tags') AND
("tree"."value" = 'sqlite'))
ORDER BY json_extract("t1"."data", '$.title')
This is just a sampling of the functionality in the json1 extension, and I'm looking forward to experimenting with it more over the coming weeks. Please feel free to leave a comment here, or if you have specific questions about the extension, you might wish to refer to the sqlite-users mailing list.
FTS5 with Python
The code in this section is a continuation of the previous JSON example code, as we'll be using the titles from the Entry
data file and using them to populate a search index. The FTS5Model
feature of peewee will be included in the 2.6.5 release, and is currently available in the master
branch on github.
Picking up from where we left off in the JSON examples, let's create another table that will be our search index for Entry data.
The fts5
extension requires that all columns be completely devoid of any types or constraints. The only additional information we can provide is to indicate a column is unindexed, meaning that the data is stored but not searchable.
Let's define a search index for the entry model that allows us to search over the titles and determine the associated URL. To do this we'll leave the url
field as unindexed.
class EntryIndex(FTS5Model):
title = SearchField()
url = SearchField(unindexed=True)
class Meta:
database = db
options = {
'prefix': [2, 3],
'tokenize': 'porter unicode61',
}
EntryIndex.create_table()
The options
dictionary provides additional metadata to the fts5
extension telling it how to tokenize words, and what lengths of prefixes to store for fast prefix search. The SQL for the CREATE TABLE
looks like this:
CREATE VIRTUAL TABLE "entryindex" USING fts5 (
"title" ,
"url" UNINDEXED,
prefix='2,3',
tokenize="porter unicode61")
To populate our index, we'll use a couple JSON functions to copy data over from the Entry
model:
title = fn.json_extract(Entry.data, '$.title').alias('title')
url = fn.json_extract(Entry.data, '$.url').alias('url')
query = Entry.select(title, url).dicts()
with db.atomic():
for entry in query:
EntryIndex.create(**entry)
Now that our index is populated, let's perform a few searches:
>>> query = EntryIndex.search('sqlite').limit(3)
>>> for result in query:
... print result.title
Extending SQLite with Python
Building the SQLite FTS5 Search Extension
My List of Python and SQLite Resources
The SQL for the above query looks like this:
SELECT "t1"."title", "t1"."url"
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite')
ORDER BY rank
We can retrieve the score of our results as well:
>>> query = EntryIndex.search('sqlite AND python', with_score=True)
>>> for result in query:
... print round(result.score, 3), result.title
-1.259 Extending SQLite with Python
-1.059 My List of Python and SQLite Resources
-0.838 Querying Tree Structures in SQLite using Python and the Transitive Closure Extension
Notice how sensible those results are? The SQL for the above query looks like:
SELECT "t1"."title", "t1"."url", rank AS score
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite AND python')
ORDER BY rank
I've barely scratched the surface of the functionality provided by the fts5
extension, but if you check out the documentation you'll find a lot of great features. Here are some examples:
- index multiple columns, then assign different weights to them when ranking.
- search by prefix, quoted phrase, or words within a certain proximity to one another.
- boolean operators can combine any of the above search types.
unicode61
default tokenizer,porter
tokenizer performs stemming.- new C API for defining your own ranking functions and tokenizers
- vocab tables, which allow you to query term counts, and generally inspect the index.
Thanks for reading
I hope you enjoyed reading this post. I think the addition of a JSON extension to SQLite is very exciting for both the project and for its users. Postgresql and MySQL both have their own JSON implementations, so it's good to see SQLite is not lagging behind. Having JSON also may obviate the need, in some cases, to reach for dedicated embedded document stores like UnQLite.
It's also worth noting that the file is named json1.c
. This is an effort, on Dr. Hipp's part, to show that this is just the first step, and that there will be more in the future. So whatever issues there may be with the current implementation, I'm sure future releases will continue to improve both the performance and APIs. For one, I believe he is looking into a binary format that would be more performant.
I'm also excited to see SQLite continuing to improve and innovate on the quality of it's full-text search extension. Providing a built-in ranking algorithm and an API for users to implement their own is a much-needed addition.
Links of interest:
- json1 documentation
- fts5 documentation
- sqlite-users mailing list
- IPython notebook containing code from this post
- Peewee documentation
- Peewee SQLite extension docs
Other blog posts you may like:
- My list of Python and SQLite resources
- Using SQLite full-text search with Python
- Meet Scout, a Search Server Powered by SQLite
- Building the Python SQLite driver for use with BerkeleyDB
- Encrypted SQLite databases with Python and SQLCipher
- Querying Tree Structures in SQLite using Python and the Transitive Closure Extension
- Extending SQLite with Python
And, just to put this out there ahead of time, here are some responses to the news of SQLite's JSON module. Don't be one of these people!
- HackerNews sheeple: Why would anyone use anything besides Postgres?
- Graybeards: I thought it was called SQL-lite, now they're bloating it up with JSON?
- Trolls: Why don't you just use MongoDB?
Comments (3)
Oisin Mulvihill | nov 18 2015, at 11:46am
Thanks for this excellent article.
I was wondering if I could cut out the source compile steps on OSX with homebrew. Compiling from source is better and gets the latest versions however homebrew saves a lot of time on OSX just to play around. It turned out I could and I was able to get sqlite3, json & ft5 extensions working with a pip installed pysqlite. I show how here http://www.sourceweaver.com/post/133474396355/homebrew-install-of-sqlite-json-extension
Anonymous | sep 17 2015, at 05:00pm
fapple
I am going to love you for this more than I already love you for peewee.
Commenting has been closed.
Charles | nov 18 2015, at 11:50am
Nice post Oisin, thanks for sharing!