I think it would be great if more sites allowed users (or consumers of their APIs) to produce and execute ad-hoc queries against their data. In this post I'll talk a little bit about some ways sites are currently doing this, some of the challenges involved, my experience trying to build something "reusable", and finally invite you to share your thoughts.
One easy way to express an ad-hoc query is with an "advanced search"
form, which provides full-text search and numerous options to narrow the results --
as a web-monkey who has coded up quite a few of these, I can say that in the simple
case they are quite easy to write and usually consist of a series of
that build up the
WHERE clause of a query. When the result set is very large
it can also be useful to pair advanced search with drill-down. Building good
drill-down search can be hard work as you need to compute "summary" information for various
facets of your result set. Advanced
search combined with drill-down is something I see on all the big e-commerce
sites. It seems to work better than straight-up text search or forcing users to
navigate a hierarchy of various classifications. Unfortunately it is somewhat rigid
and usually requires you to specify up-front what facets to expose. Full-text can
also be expensive and may not always make sense.
Another popular way of sharing information is by exposing collections of serialized
data at well-known URLs. List view/detail view, we all know how it works. Taking a
look at github's API, they've chosen to nest collections so a particular user's
repositories are accessible via the
users endpoint, e.g.
In a sense, this is a form of an ad-hoc query, albeit a very limited one since we can only filter collections of repos by their owner. Github's API documentation lists a number of "parameters" you can use to specify additional filtering/sorting options.
Another popular site with coders, readthedocs took a different
approach, exposing flat collections and allowing the user to filter the collection
by serializing filters into the request's
This is an interesting technique and is a good way of producing simple queries. For users of the django ORM, it should feel pretty familiar since this is the same way you specify various lookups when querying the database. It allows for simple "AND" lookups only.
The last technique I'll just mention is the approach taken by Yahoo! with YQL, a SQL-like DSL for querying diverse data sources like flickr and yelp. There are quite a few sites that implement yahoo's open data table API. YQL is more expressive and allows developers to create finer-grained queries than either of the previous two techniques. The whole thing strikes me as kinda gross, though.
There were some reusable way of exposing your data to your users in such a way that they could create and execute their own queries. Currently, I imagine this happening either somewhere on the frontend and the interface being composed of form elements, or via a JSON (etc) API using the querystring. Maybe you have some other ideas - if so please leave a comment below!
Each of these approaches has some limitations, though. Typically a query is represented
as a tree. In SQL we create depth using parentheses and the expressions at the "leaves"
are joined by either
-- get active staff or admin users -- SELECT * FROM users WHERE ((is_admin = true OR is_staff = true) AND is_active = true);
This might look something like this:
AND / \ OR (is_active = true) / \ / (is_staff = true) / (is_admin = true)
The iTunes smart playlist is a good example of how this type of interface might work:
How do you go about serializing this information into HTML form elements? To complicate things further, what if you want to expose fields that are on a model related by foreign key? What if we are looking at a list of comments and want to get only those made by active staff or admin users?
-- get comments made by active staff or admin users -- SELECT * FROM comments, users WHERE ( (users.is_admin = true OR users.is_staff = true) AND users.is_active = true AND users.id = comments.user_id );
Solving this problem, particularly in a configurable and reusable way, could really empower users. I imagine it being useful for reporting or analytics software, but also as a replacement for typical "advanced search". For instance querying GitHub for projects written in language X, starred by over Y people, where the description contains one of several keywords.
I've been thinking a lot about how to allow dynamic ad-hoc queries for peewee
models (peewee is a python ORM). The idea is
to introspect the models and provide a configurable (and reusable) way for allowing
ad-hoc queries via either a form-like interface or via the querystring. So far the
implementation is simple. While it does allow filtering by fields across foreign
keys, it does not support nesting. Instead it assumes that if you are querying a
column more than once you want to
OR those columns together. It looks like this:
It's not perfect, but it's also not a whole lot of code. The best part is I'm able to reuse a good chunk between the "admin" interface you see above and the JSON API.
I'm very interested to hear others' opinions on how to tackle this problem, including how you might serialize query data, so feel free to leave a comment. Hope you enjoyed reading!
I'm just learning about this area, but I thought the tastypie app provided the ability for ad-hoc queries. On further thought (told you I'm just a beginner), I guess not, since you have to have setup resources to match so it is more like the readthedocs approach.
What do you think about htsql? That seems to allow you to do whatever you want as a user. [htsql.org/doc]
Standards-based data interchange is an interesting area with several technology candidates. In the enterprise space, XML, SAML, and their ilk have a foothold. In the web space it's not clear what the options are. Personally I like ElasticSearch's DSL. The Data Protocols community is trying to tackle this problem head-on: http://www.dataprotocols.org/en/latest/index.html
Hope this is useful.
Tim -- thanks for your message. You're correct on both counts, actually, as readthedocs uses tastypie to provide its API. It's a very popular project among django devs.
Salman, thanks for the link, the "query protocol" is exactly what I'm talking about, and they explicitly mention how nice ES's DSL is. Cool stuff, I am actually adding a link to this page in the 'related links' heading.
How do you see popular APIs like the Google Analytics Core Reporting API, which allow for ad hoc queries?
Al -- thanks for you comment. I'd recommend checking out the google analytics query explorer -- gives a nice overview of how ad-hoc queries are constructed for their reporting API: http://ga-dev-tools.appspot.com/explorer/
Commenting has been closed, but please feel free to contact me