Missing Link

Edit 11/13/2013: I added some Django integration to peewee to make it (hopefully) easier to build structured queries when you need to work around the ORM's limitations. Read the docs. Any feedback would definitely be appreciated!


I had the opportunity this week to write some fairly interesting SQL queries. I don't write "raw" SQL too often, so it was fun to use that part of my brain (by the way, does it bother anyone else when people call SQL "raw"?). At Counsyl we use Django for pretty much everything so naturally we also use the ORM. Every place I've worked there's a strong bias against using SQL when you've got an ORM on board, which makes sense -- if you choose a tool you should standardize on it if for no other reason than it makes maintenance easier.

So as I was saying, I had some pretty interesting queries to write and I struggled to think how to shoehorn them into Django's ORM. I've already written about some of the shortcomings of Django's ORM so I won't rehash those points. I'll just say that Django fell short and I found myself writing SQL. The queries I was working on joined models from very disparate parts of our codebase. The joins were on values that weren't necessarily foreign keys (think UUIDs) and this is something that Django just doesn't cope with. Additionally I was interested in aggregates on calculated values, and it seems like Django can only do aggregates on a single column.

As I was prototyping, I found several mistakes in my queries and decided to run them in the postgres shell before translating them into my code. I started to think that some of these errors could have been avoided if I could find an abstraction that sat between the ORM and a string of SQL. By leveraging the python interpreter, the obvious syntax errors could have been caught at module import time. By using composable data structures, methods I wrote that used similar table structures could have been more DRY. When I write less code, I think I generally write less bugs as well.

That got me started on my search for the "missing link" between SQL (represented as a string) and Django's ORM.

Django query internals

Have you gone digging into Django's db.models.sql modules? I hadn't spent much time there, so I took a journey. There's a module named aggregates that has a handful of classes for representing various aggregation functions over a single column. There's a module named where for representing the query tree that comprises the where clause, and also has the SQL generation code for expressions like status = true and IN, IS NULL, etc. There's a module named query and also one called subqueries. There are methods with obscure messages in the docstrings, like Query.add_filter:

If 'negate' is True, this is an exclude() filter. It's important to note that this method does not negate anything in the where-clause object when inserting the filter constraints. This is because negated filters often require multiple calls to add_filter() and the negation should only happen once. So the caller is responsible for this (the caller will normally be add_q(), so that as an example).

The whole thing is so baroque that I started to wonder if I'd ever find the missing link. It seems that Django was built using what I see as a "top-down" approach, where there are several very large classes which encapsulate and hide the workings of smaller subclasses which are only used internally. As soon as you get one or two levels in, you meet docstrings like the one above and at that point you're sunk.

Even the module structure implies this top-down design. django.db.models exposes familiar friends like fields and models. Another level down you have the sql modules. In the __init__ module of django.db.models.sql, there are only four classes that are exposed to the outside world:

__all__ = ['Query', 'AND', 'OR', 'EmptyResultSet']

But as I said, if you go looking around you'll find all sorts of other things. The top-down approach encourages consumers of the API to treat these classes and modules as black boxes, but if you go looking for the missing link... well, good luck.

Top-down makes hacking hard

I came away from my code dive with the conclusion that top-down designs are going to be hard to hack on, particularly when the part you're interested in is far below the facade. Contrast the design of Django's ORM with the architecture of the excellent library SQLAlchemy (link to discussion of sqa's architecture). Above the database and database-specific dialect is the SQL expression engine. That sounds like a missing link to me, and indeed the SQL expression engine is a pythonic API for generating the expressions that, when composed, create queries.

Peewee is similarly built using a bottom-up approach. Almost all the classes used to construct queries are derived from a class named Node, which can be combined in predictable ways with the various other subclasses such as fields, expressions, aggregations, etc.

Was the Django ORM's design intentional, or just an accident?

Archaeology

I pulled up the source code for Django 1.0, and was surprised to see that much of the sql module looks the same as it does now, 5 years later. I went further back to Django 0.96, 7 years in the past (a really long time!) and started to see how the design of the ORM's API came about. The high-level methods we know like .filter(), .values() were all there, but the SQL generation was pretty disorganized. There didn't seem to be any real analog to 1.0's sql modules.

Looking at it this way, maybe when the ORM was rewritten, the top-down approach made sense because the foremost concern of the Django devs was to maintain backwards compatibility with the previous APIs. Django has a great reputation for not breaking things between releases (something, as a developer, I am grateful for). Maybe the rewritten ORM was such a massive improvement over the old one that nobody was remotely critical of its design.

Now what?

I think it would be useful for Django to have a structured representation of SQL that is accessible to developers. The ORM is great, and writing SQL by hand is flexible, but having a pythonic layer between the two would definitely be a good thing.

I wonder if there would be value in building a bridge between SQLAlchemy or peewee's expression building, and Django's "raw" query facilities -- what do you think?

A long quote from Richard Feynman

This quote was taken from Richard Feynman's Personal observations on the reliability of the Shuttle, written after the Challenger disaster:

The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, etc., are so well understood. There is a very good chance that the modifications to the engine to get around the final difficulties are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive, stages of the process.

The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, etc.? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose an entire engine in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquire a confidence in the engine reliability in use. Without detailed understanding, confidence can not be attained.

A further disadvantage of the top-down method is that, if an understanding of a fault is obtained, a simple fix, such as a new shape for the turbine housing, may be impossible to implement without a redesign of the entire engine.

Reading more

Thanks for taking the time to read this post, feel free to leave a comment below!

Comments (7)

  • Eric Carmichael | November 2013, at 17:15

    "I wonder if there would be value in building a bridge between SQLAlchemy or peewee's expression building, and Django's "raw" query facilities -- what do you think?" -- I think that's such an awesome idea! I never thought of that, could get a lot of cool functionality that way.

    Will you have a follow up article about inserting the missing link, how well it works, and if it works as well as you expected?!


  • Charlie | November 2013, at 17:25

    Thanks for reading Eric! Yes, I'll probably hack something up using peewee and see how it works. I'll post an update in the next day or two.


  • Luke Plant | November 2013, at 19:21

    The ORM was written very much with the "top down" approach, and this was pretty much accidental.

    Early versions of the ORM had some pretty simplistic logic regarding joins, for example, which worked surprisingly well, but incorrectly some of the time.

    Over time these were re-written, maintaining compatibility, while at the same time new high level features were added. So the lower level SQL layer just expanded as necessary to support the higher levels. The lack of coherent design is an accident, but it is not easily fixed.

    In the past I have suggested re-writing Django's high level ORM on top of SQLAlchemy, but there are some fundamental differences (things like unit of work etc) that stand in the way, plus it wouldn't reduce the amount of code in Django by as much as you might hope.

    If you find yourself making mistakes with raw SQL, as I do, you can always use SQLAlchemy in a Django project. The biggest problem is that you can end up with two types of connection object (with their own transactions etc), but this can be fixed using aldjemy - https://github.com/Deepwalker/aldjemy


  • Charlie | November 2013, at 19:28

    Thank you so much for your comment Luke. I'm glad you could shed some light on the development of the ORM and some of the challenges involved in changing it.


  • Justin | November 2013, at 00:16

    This video from Malcom Tredinnick may answer some of your questions about the history of the ORM. He did most of the rewrite http://www.youtube.com/watch?v=bgV39DlmZ2U


  • Dan | November 2013, at 08:34

    To answer your question at the top of the article: I use raw SQL, but only after I've exhausted the features of the ORM. Using the ORM when possible isolates my code from changes to the Django platform and in the underlying database, and makes my code more portable. However, the ORM is limited, so anything out of the ordinary may require going raw.


  • Charlie | November 2013, at 10:11

    Justin - thanks for sharing that video, I'll check it out.

    Dan - good point about portability, that is one of the better features of the ORM and I did not mention it in my post.


Commenting has been closed, but please feel free to contact me