Using the SQLite JSON Extension with Python (old version)

Update November 11, 2015

With the release of SQLite 3.9.0, the instructions for compiling SQLite and building pysqlite have been greatly simplified.

I've written a new version of this post with updated instructions for compiling SQLite with JSON1 and FTS5. The post includes instructions for compiling pysqlite and apsw, as well as example code showing how to use these new extensions in your Python projects.

A couple weeks ago, word started getting around trendy programming circles about a new file that had appeared in the SQLite fossil repo named json1.c. Then last Friday, Dr. Hipp posted to the sqlite-users mailing list requesting feedback for a draft of the json1 module APIs, so I thought the Python community might be able to help out. In this post, we'll build SQLite with the new JSON extension, then build pysqlite against the json-ready SQLite. I've done this now on arch and ubuntu, but I'm not sure about fapple or windoze.

We'll be grabbing the latest SQLite source code -- to do this you can use fossil, the source-code management system used by SQLite, or alternatively you can pull down a compressed image. SQLite also requires tcl and awk to create the source amalgamation, so before starting in, you'll need to install:

There were a couple steps involved, so I'll try and break it down into sub-steps. To get started, we need a nice clean home for the new libraries. I put mine in ~/bin/jqlite for fun, but you can change the path to whatever you like.

export JQLITE="$HOME/bin/jqlite"
mkdir -p $JQLITE
cd $JQLITE

Option A, use fossil:

fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
fossil open sqlite.fossil

Option B, fetch source image:

curl 'https://www.sqlite.org/src/tarball/sqlite.tar.gz?ci=trunk' | tar xz
mv sqlite/* .

You should now be in a directory alongside the SQLite source tree. Now we'll compile a static library, the only trick being that we'll manually include the code from the json1 extension:

export CFLAGS="-fPIC -DSQLITE_ENABLE_FTS3=1 \
-DSQLITE_ENABLE_COLUMN_METADATA=1 \
-DSQLITE_ENABLE_UNLOCK_NOTIFY \
-DSQLITE_SECURE_DELETE \
-DSQLITE_ENABLE_LOAD_EXTENSION=1"
./configure --prefix=$JQLITE --enable-static --disable-shared
make sqlite3.c
cat ext/misc/json1.c >> sqlite3.c
make lib_install

There should now be a file named lib/libsqlite3.a in the SQLite3 source checkout. If not, cruise the console output and hopefully the error is easy to spot.

The final step (and for me, the trickiest) is to build pysqlite against the new static library. It turns out this isn't hard, but in order for the JSON extension to be usable, it needs to be initialized when a connection is open. Looking at the source, there are conditional compilation flags in the SQLite openDatabase function that automatically initialize popular extensions like FTS and RTREE. So I opted to do a little hand-hacking of pysqlite to automatically register the JSON extension when the module is initialized -- that is the gnarly sed command. Other than that, it's pretty straightforward:

git clone https://github.com/ghaering/pysqlite
cd pysqlite
cp ../sqlite3.c .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE" >> setup.cfg
echo '#include "sqlite3.h"' | cat - src/module.c > src/module.tmp
sed -i "s|Py_InitModule\(.*$\)|Py_InitModule\1extern int sqlite3_json_init(sqlite3*);sqlite3_auto_extension((void(*)(void))sqlite3_json_init);|g" src/module.tmp
mv src/module.tmp src/module.c
python setup.py build_static

To test the install, you can cd into the build/lib.linux-xfoobar/ directory, open a python interpreter, and run the following:

>>> from pysqlite2.dbapi2 import connect
>>> conn = connect(':memory:')
>>> conn.execute('select json(?)', (1337,)).fetchone()
(u'1337',)

Phew. Now it's your choice, you can run python setup.py install, or symlink the newly-built pysqlite2 (look in build/lib.linux.../) into your $PYTHONPATH.

Using the JSON extension

There are some really neat features in the json1 extension, particularly the json_tree and json_each functions/virtual tables (documented here). To demonstrate the new features, we'll use peewee, a small Python ORM, to write some JSON data then query it.

I originally thought to get test data from GitHub's API, but in order to show off the features with minimum verbosity, I've instead contrived a little JSON file which can be viewed here. It is structured like so:

[{
   "title": "My List of Python and SQLite Resources",
   "url": "http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/",
   "metadata": {"tags": ["python", "sqlite"]}
 },
 {
   "title": "Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python"
   "url": "http://charlesleifer.com/blog/using-sqlite4-s-lsm-storage-engine-as-a-stand-alone-nosql-database-with-python/",
   "metadata": {"tags": ["nosql", "python", "sqlite", "cython"]}
  },
  ...
]

If you prefer to view the code in IPython notebook format, you can view the notebook here.

Populating the database

We'll start by fetching the JSON data file and decoding it:

>>> import json, urllib2
>>> fh = urllib2.urlopen('http://media.charlesleifer.com/downloads/misc/blogs.json')
>>> data = json.loads(fh.read())
>>> data[0]
{u'metadata': {u'tags': [u'python', u'sqlite']},
 u'title': u'My List of Python and SQLite Resources',
 u'url': u'http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/'}

Now we need to tell peewee how to access our database, so we'll wrap the SQLite database to use our custom pysqlite. I'm importing pysqlite2, which is the module we just compiled, but I'm aliasing it to jqlite just so there's no confusion. After the database class has been defined, we'll create an in-memory database.

>>> from pysqlite2 import dbapi2 as jqlite
>>> from peewee import *
>>> from playhouse.sqlite_ext import *
>>> class JQLiteDatabase(SqliteExtDatabase):
...     def _connect(self, database, **kwargs):
...         conn = jqlite.connect(database, **kwargs)
...         conn.isolation_level = None
...         self._add_conn_hooks(conn)
...         return conn
...
>>> db = JQLiteDatabase(':memory:')

Populating the database with our JSON data is pretty straightforward. We'll create a generic table with a single TEXT field. At present, SQLite does not expose a separate column/data-type for JSON data, so we will use TextField:

>>> class Entry(Model):
...     data = TextField()
...     class Meta:
...         database = db
...
>>> Entry.create_table()
>>> with db.atomic():
...     for entry_json in data:
...         Entry.create(data=json.dumps(entry_json))
...

JSON functions

We'll start by looking at json_extract(). It takes a dotted / bracketed path describing the element to find (unlike postgres which always uses []). Each Entry in the database contains a single data column which contains a JSON object. This JSON object has a title, url and metadata key at the top-level. Let's see how to extract the titles of our entries:

>>> title = fn.json_extract(Entry.data, '$.title')
>>> query = (Entry
...          .select(title.alias('title'))
...          .order_by(title)
...          .limit(5))
...
>>> [row for row in query.dicts()]
[{'title': u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More'},
 {'title': u'Alternative Redis-Like Databases with Python'},
 {'title': u'Building the SQLite FTS5 Search Extension'},
 {'title': u'Connor Thomas Leifer'},
 {'title': u'Extending SQLite with Python'}]

The query we created corresponds to the following SQL:

SELECT json_extract("t1"."data", '$.title') AS title
FROM "entry" AS t1
ORDER BY json_extract("t1"."data", '$.title')
LIMIT 5

In the next example we will extract the titles of entries that contain the given tag. To search the tags list, we will use a new function json_each(). This function behaves like a table (and in fact refers to a virtual table), and returns the direct children of the specified JSON path. Here is how to find the titles of entries tagged with "Sqlite":

>>> from peewee import Entity
>>> tags_src = fn.json_each(Entry.data, '$.metadata.tags').alias('tags')
>>> tags_ref = Entity('tags')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tags_src)
...          .where(tags_ref.value == 'sqlite')
...          .order_by(title))
...
>>> [row for row, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

The SQL for the above query may help elucidate what's going on:

SELECT json_extract("t1"."data", '$.title') AS title
FROM
    "entry" AS t1,
    json_each("t1"."data", '$.metadata.tags') AS tags
WHERE ("tags"."value" = 'sqlite')
ORDER BY json_extract("t1"."data", '$.title')

As queries grow more complex, being able to encapsulate pieces of the query with Peewee objects becomes more useful and can facilitate code reuse.

Here is another example of json_each(). This time we will select the title of each entry, and build a comma-separated string of their associated tags. We will re-use the tags_src and tags_ref definitions created above.

>>> query = (Entry
...          .select(
...              title.alias('title'),
...              fn.group_concat(tags_ref.value, ', ').alias('tags'))
...          .from_(Entry, tags_src)
...          .group_by(title)
...          .limit(5))
...
>>> [row for row in query.tuples()]
[(u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More',
  u'peewee, sql, python'),
 (u'Alternative Redis-Like Databases with Python',
  u'python, walrus, redis, nosql'),
 (u'Building the SQLite FTS5 Search Extension',
  u'sqlite, search, python, peewee'),
 (u'Connor Thomas Leifer', u'thoughts'),
 (u'Extending SQLite with Python', u'peewee, python, sqlite')]

Again, for clarity, here is the corresponding SQL query:

SELECT
    json_extract("t1"."data", '$.title') AS title,
    group_concat("tags"."value", ', ') AS tags
FROM
    "entry" AS t1,
    json_each("t1"."data", '$.metadata.tags') AS tags
GROUP BY json_extract("t1"."data", '$.title')
LIMIT 5

The last function I'll cover is json_tree(). Like json_each(), json_tree() is a multi-valued function that behaves like a table. Where json_each() only returns the direct descendants of the specified path, json_tree() will recursively traverse the object, returning all children.

If the tags key was nested somewhere arbitrarily deep for each entry, here is how we might find entries that matched a given tag:

>>> tree = fn.json_tree(Entry.data, '$').alias('tree')
>>> parent = fn.json_tree(Entry.data, '$').alias('parent')

>>> tree_ref = Entity('tree')
>>> parent_ref = Entity('parent')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tree, parent)
...          .where(
...              (tree_ref.parent == parent_ref.id) &
...              (parent_ref.key == 'tags') &
...              (tree_ref.value == 'sqlite'))
...          .order_by(title))
...
>>> [title for title, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

What's going on in the above code is we are selecting the Entry itself, along with two trees representing the entry's child nodes. Since each tree node contains a reference to it's parent, we can simply search for a parent node named "tags" that contains a child node with the value "sqlite".

Here it is in SQL:

SELECT json_extract("t1"."data", '$.title') AS title
FROM
    "entry" AS t1,
    json_tree("t1"."data", '$') AS tree,
    json_tree("t1"."data", '$') AS parent
WHERE (
    ("tree"."parent" = "parent"."id") AND
    ("parent"."key" = 'tags') AND
    ("tree"."value" = 'sqlite'))
ORDER BY json_extract("t1"."data", '$.title')

This is just a sampling of the functionality in the JSON1 extension, and I'm looking forward to experimenting with it more over the coming weeks. Please feel free to leave a comment here, or if you have specific questions about the extension, you might wish to refer to the sqlite-users mailing list.

Thanks for reading

I hope you enjoyed reading this post. I think the addition of a JSON extension to SQLite is very exciting for both the project and for its users. Postgresql and MySQL both have their own JSON implementations, so it's good to see SQLite is not lagging behind. Having JSON also may obviate the need, in some cases, to reach for dedicated embedded document stores like UnQLite.

It's also worth noting that the file is named json1.c. This is an effort, on Dr. Hipp's part, to show that this is just the first step, and that there will be more in the future. So whatever issues there may be with the current implementation, I'm sure future releases will continue to improve both the performance and APIs. For one, I believe he is looking into a binary format that would be more performant.

Links of interest:

And, just to put this out there ahead of time, here are some responses to the news of SQLite's JSON module. Don't be one of these people!

Comments (0)


Commenting has been closed, but please feel free to contact me