September 05, 2012 11:41 / 5 comments / django sql

I think it would be great if more sites allowed users (or consumers of their APIs) to produce and execute ad-hoc queries against their data. In this post I'll talk a little bit about some ways sites are currently doing this, some of the challenges involved, my experience trying to build something "reusable", and finally invite you to share your thoughts.

A few popular ways of doing ad-hoc queries

One easy way to express an ad-hoc query is with an "advanced search" form, which provides full-text search and numerous options to narrow the results -- as a web-monkey who has coded up quite a few of these, I can say that in the simple case they are quite easy to write and usually consist of a series of if statements that build up the WHERE clause of a query. When the result set is very large it can also be useful to pair advanced search with drill-down. Building good drill-down search can be hard work as you need to compute "summary" information for various facets of your result set. Advanced search combined with drill-down is something I see on all the big e-commerce sites. It seems to work better than straight-up text search or forcing users to navigate a hierarchy of various classifications. Unfortunately it is somewhat rigid and usually requires you to specify up-front what facets to expose. Full-text can also be expensive and may not always make sense.

Another popular way of sharing information is by exposing collections of serialized data at well-known URLs. List view/detail view, we all know how it works. Taking a look at github's API, they've chosen to nest collections so a particular user's repositories are accessible via the users endpoint, e.g.

https://api.github.com/users/coleifer/repos

In a sense, this is a form of an ad-hoc query, albeit a very limited one since we can only filter collections of repos by their owner. Github's API documentation lists a number of "parameters" you can use to specify additional filtering/sorting options.

Another popular site with coders, readthedocs took a different approach, exposing flat collections and allowing the user to filter the collection by serializing filters into the request's querystring:

http://readthedocs.org/api/v1/project/?format=json&slug__contains=peewee

This is an interesting technique and is a good way of producing simple queries. For users of the django ORM, it should feel pretty familiar since this is the same way you specify various lookups when querying the database. It allows for simple "AND" lookups only.

The last technique I'll just mention is the approach taken by Yahoo! with YQL, a SQL-like DSL for querying diverse data sources like flickr and yelp. There are quite a few sites that implement yahoo's open data table API. YQL is more expressive and allows developers to create finer-grained queries than either of the previous two techniques. The whole thing strikes me as kinda gross, though.

Wouldn't it be cool if...

There were some reusable way of exposing your data to your users in such a way that they could create and execute their own queries. Currently, I imagine this happening either somewhere on the frontend and the interface being composed of form elements, or via a JSON (etc) API using the querystring. Maybe you have some other ideas - if so please leave a comment below!

Each of these approaches has some limitations, though. Typically a query is represented as a tree. In SQL we create depth using parentheses and the expressions at the "leaves" are joined by either AND or OR connectors:

-- get active staff or admin users --
SELECT * FROM users
WHERE ((is_admin = true OR is_staff = true) AND is_active = true);

This might look something like this:

            AND
           /   \
          OR   (is_active = true)
        /    \
       /   (is_staff = true)
      /
(is_admin = true)

The iTunes smart playlist is a good example of how this type of interface might work:

Smart playlist

How do you go about serializing this information into HTML form elements? To complicate things further, what if you want to expose fields that are on a model related by foreign key? What if we are looking at a list of comments and want to get only those made by active staff or admin users?

-- get comments made by active staff or admin users --
SELECT * FROM comments, users
WHERE (
    (users.is_admin = true OR users.is_staff = true) AND
    users.is_active = true AND
    users.id = comments.user_id
);

I don't want to forget the other method, a JSON API, for power users who may just be interested in serialized raw data (rather than accessing javascript-heavy forms via the frontend). There is no reason the actual query serialization need be any different between the form version and the API version, except if the form version were too cumbersome or was too awkward to use in the simple case. Another win would be to lean on named form and querystring parameters, and not have to implement our own query parser.

Solving this problem, particularly in a configurable and reusable way, could really empower users. I imagine it being useful for reporting or analytics software, but also as a replacement for typical "advanced search". For instance querying GitHub for projects written in language X, starred by over Y people, where the description contains one of several keywords.

Why do I care?

I've been thinking a lot about how to allow dynamic ad-hoc queries for peewee models (peewee is a python ORM). The idea is to introspect the models and provide a configurable (and reusable) way for allowing ad-hoc queries via either a form-like interface or via the querystring. So far the implementation is simple. While it does allow filtering by fields across foreign keys, it does not support nesting. Instead it assumes that if you are querying a column more than once you want to OR those columns together. It looks like this:

Model filters

It's not perfect, but it's also not a whole lot of code. The best part is I'm able to reuse a good chunk between the "admin" interface you see above and the JSON API.

I'm very interested to hear others' opinions on how to tackle this problem, including how you might serialize query data, so feel free to leave a comment. Hope you enjoyed reading!

Related links

More like this

Comments (5)

Tim Arnold | sep 06 2012, at 08:27am

I'm just learning about this area, but I thought the tastypie app provided the ability for ad-hoc queries. On further thought (told you I'm just a beginner), I guess not, since you have to have setup resources to match so it is more like the readthedocs approach.

What do you think about htsql? That seems to allow you to do whatever you want as a user. [htsql.org/doc]

Salman | sep 06 2012, at 02:30pm

Standards-based data interchange is an interesting area with several technology candidates. In the enterprise space, XML, SAML, and their ilk have a foothold. In the web space it's not clear what the options are. Personally I like ElasticSearch's DSL. The Data Protocols community is trying to tackle this problem head-on: http://www.dataprotocols.org/en/latest/index.html

Hope this is useful.

Charles Leifer | sep 06 2012, at 02:54pm

Tim -- thanks for your message. You're correct on both counts, actually, as readthedocs uses tastypie to provide its API. It's a very popular project among django devs.

Salman, thanks for the link, the "query protocol" is exactly what I'm talking about, and they explicitly mention how nice ES's DSL is. Cool stuff, I am actually adding a link to this page in the 'related links' heading.

Alistair Lattimore | sep 07 2012, at 07:41pm

Charles,

How do you see popular APIs like the Google Analytics Core Reporting API, which allow for ad hoc queries?

Al.

Charles Leifer | sep 08 2012, at 01:14pm

Al -- thanks for you comment. I'd recommend checking out the google analytics query explorer -- gives a nice overview of how ad-hoc queries are constructed for their reporting API: http://ga-dev-tools.appspot.com/explorer/


Commenting has been closed, but please feel free to contact me