Shortcomings in the Django ORM and a look at Peewee, a lightweight alternative
In this post I'd like to talk about some of the shortcomings of the Django ORM, the ways peewee approaches things differently, and how this resulted in peewee having an API that is both more consistent and more expressive.
Alex Gaynor, one of the more outspoken core developers on the Django project, gave a great talk at ChiPY titled "Why I Hate the Django ORM" (slides). I think he did a great job identifying what I agree are the two biggest issues with Django's ORM:
- inconsistent API
- lack of composability
The Django ORM has an inconsistent API
The framework should be consistent at all levels. Consistency applies to everything from low-level (the Python coding style used) to high-level (the “experience” of using Django).
On the whole I think Django does a good job with this - the big exception being the ORM. Alex gives 4 examples, which should be familiar to Django developers and which are indicative of the underlying issues:
filter() is the most basic method and is the most common. It is used to express a SQL Where clause and takes as parameters a list of keyword arguments mapping field names to values, and it can take a list of Q objects. Q objects also take a keyword argument and are used to allow "combining" one or more expressions into a query tree (this is how you express a logical "OR").
This is the first inconsistency -- there are two methods of expressing the same query, one is just more "specialized":
>>> sq1 = str(Blog.objects.filter(author='charlie').query) >>> sq2 = str(Blog.objects.filter(Q(author='charlie')).query) >>> sq1 == sq2 True
What happens if you want to reference the value of another column in your call to filter? Maybe there is an Employee model that stores their actual salary and their desired salary...for this you use the F class:
Here we see two ways of identifying fields -- as keyword arguments passed to filter and Q, and as strings when passed to F.
Another common operation is to take the aggregate of the values in a particular common, like the SUM or COUNT, or GROUP BY a particular column. There are a ton of questions on stackoverflow asking over how to do this. The answer involves using three new APIs:
values()- takes a list of string field names
annotate()- takes an "aggregate" function
Sum(), etc - take a string field name
The fact that all these specialized functions are needed to express a fairly common query, and the fact that they all require a specialized API, is a sign of a lurking design problem. For anything more than a simple Where clause, Django quickly bogs down in its own APIs.
Lack of composability
The database API should allow rich, expressive statements in as little syntax as possible. It should not rely on importing other modules or helper objects.
Django falls very short on the first part. Rich, expressive statements are simply not possible in the ORM unless a special API was designed for it (like Q and F objects). To create a rich API, it is necessary to allow composability -- small pieces can be composed to create larger, more complex pieces.
Alex gives a good example:
SELECT event_type, SUM(end_time - start_time) AS total_time FROM event GROUP BY event_type
Which he expresses using Django:
Event.objects.values('event_type').annotate( total_time=Sum(F('end_time') - F('start_time')) )
This actually does not work, Django chokes because the Sum function does not know how to handle an expression. Essentially anything beyond the very straightforward use-cases the existing APIs were designed for will probably not work or will have been addressed by some special-case logic.
Django's ORM falls short on the second part of the design goal as well, requiring users to import special functions for aggregation and use odd one-letter classes ("Q" and "F") to express certain types of queries.
Learning from Django
When I first wrote peewee I based a lot of my APIs on those I was familiar with. Coming from Django this meant "kwargs"-style querying and double-underscore lookups. As peewee grew and I added features, I mimicked django and added APIs for expressing logical "OR" (Q objects) and column-to-column comparison (F objects). Before long the code was a mess and people were submitting issues when they tried to express a query I hadn't planned for. I think that this is a parallel to how the Django ORM has grown up.
I decided to rewrite. It was the best decision I could have made and I learned a ton in the process. I don't have near the number of users, so this was an acceptable path for my project. Peewee is also only an ORM, so the argument for making it a better ORM outweighs some other concerns.
While I rewrote I decided to focus entirely on the mechanics of expressing rich queries. To do this I took a look at all the atoms that comprise a SQL query, how they interact, and how they are composed. I identified a few things:
- Scalars and Parameters (e.g.
foo = "Bar")
Clauses are things like SELECT and FROM which denote particular parts of the query -- these I decided to expose as methods on a class. Columns, or field instances, would be exposed as class attributes. Scalars and parameters were easy, just represent them using python's various types. Finally functions, since they are so diverse and accept varying numbers of parameters, would be exposed dynamically.
This allows us to write the above "events" query:
Event.select( Event.event_type, fn.Sum(Event.end_time - Event.start_time).alias('total_time') ).group_by(Event.event_type)
The clauses accept as their arguments one or more columns, scalars, functions, or any combination thereof. These components can be combined using logical OR and AND to create query trees. They also support common operations like addition and subtraction, allowing you to express atomic updates and things like the above example where we subtract one column from another.
Peewee is more consistent and expressive
Peewee shed about a third of its code-base during the rewrite, going from 2400 SLOC to around 1600! It also became much more expressive -- more than once I have written a fairly complex query and been pleasently surprised to see that it just works. Rather than losing functionality, I have gained flexibility which in turn produces functionality.
I think that Django would benefit from a similar rewrite. The ORM is one of the most complicated parts of Django and sees some pretty crazy bugs. I'm sure I'm not the only person who thinks this should happen...the problem is:
How can the ORM change without breaking backwards compatibility?
From the beginning Django has been committed to backwards compatibility. This may be one of the biggest contributors to Django's adoption.
I would suggest building out a new API that is similar to peewee's. Since the existing APIs are a subset of the functionality possible in peewee, the existing APIs could be rewritten to use the new APIs and marked for deprecation. When rewriting peewee, I included a backwards-compatible method to allow the "django-style" double-underscore querying that had been possible in the older version.
What do you think?
What is your take on the Django ORM? If you use something like SQLAlchemy or peewee, how do you feel that it compares to Django's ORM? Are there things Django can do that Peewee or SQLAlchemy cannot? One example is that, while peewee supports "prefetch_related", it is not quite as powerful as Django's implementation (Full disclosure, I think this feature is kind of gross).
- "Why I hate the Django ORM", talk by Alex Gaynor at ChiPY and slides
- Django's design philosophy
- Notes from the peewee rewrite
- Peewee project page
Commenting has been closed, but please feel free to contact me