In this post I'd like to talk about some of the shortcomings of the Django ORM, the ways peewee approaches things differently, and how this resulted in peewee having an API that is both more consistent and more expressive.
Alex Gaynor, one of the more outspoken core developers on the Django project, gave a great talk at ChiPY titled "Why I Hate the Django ORM" (slides). I think he did a great job identifying what I agree are the two biggest issues with Django's ORM:
The framework should be consistent at all levels. Consistency applies to everything from low-level (the Python coding style used) to high-level (the “experience” of using Django).
On the whole I think Django does a good job with this - the big exception being the ORM. Alex gives 4 examples, which should be familiar to Django developers and which are indicative of the underlying issues:
filter() is the most basic method and is the most common. It is used to express a SQL Where clause and takes as parameters a list of keyword arguments mapping field names to values, and it can take a list of Q objects. Q objects also take a keyword argument and are used to allow "combining" one or more expressions into a query tree (this is how you express a logical "OR").
This is the first inconsistency -- there are two methods of expressing the same query, one is just more "specialized":
>>> sq1 = str(Blog.objects.filter(author='charlie').query) >>> sq2 = str(Blog.objects.filter(Q(author='charlie')).query) >>> sq1 == sq2 True
What happens if you want to reference the value of another column in your call to filter? Maybe there is an Employee model that stores their actual salary and their desired salary...for this you use the F class:
Here we see two ways of identifying fields -- as keyword arguments passed to filter and Q, and as strings when passed to F.
Another common operation is to take the aggregate of the values in a particular common, like the SUM or COUNT, or GROUP BY a particular column. There are a ton of questions on stackoverflow asking over how to do this. The answer involves using three new APIs:
values()- takes a list of string field names
annotate()- takes an "aggregate" function
Sum(), etc - take a string field name
The fact that all these specialized functions are needed to express a fairly common query, and the fact that they all require a specialized API, is a sign of a lurking design problem. For anything more than a simple Where clause, Django quickly bogs down in its own APIs.
The database API should allow rich, expressive statements in as little syntax as possible. It should not rely on importing other modules or helper objects.
Django falls very short on the first part. Rich, expressive statements are simply not possible in the ORM unless a special API was designed for it (like Q and F objects). To create a rich API, it is necessary to allow composability -- small pieces can be composed to create larger, more complex pieces.
Alex gives a good example:
SELECT event_type, SUM(end_time - start_time) AS total_time FROM event GROUP BY event_type
Which he expresses using Django:
Event.objects.values('event_type').annotate( total_time=Sum(F('end_time') - F('start_time')) )
This actually does not work, Django chokes because the Sum function does not know how to handle an expression. Essentially anything beyond the very straightforward use-cases the existing APIs were designed for will probably not work or will have been addressed by some special-case logic.
Django's ORM falls short on the second part of the design goal as well, requiring users to import special functions for aggregation and use odd one-letter classes ("Q" and "F") to express certain types of queries.
When I first wrote peewee I based a lot of my APIs on those I was familiar with. Coming from Django this meant "kwargs"-style querying and double-underscore lookups. As peewee grew and I added features, I mimicked django and added APIs for expressing logical "OR" (Q objects) and column-to-column comparison (F objects). Before long the code was a mess and people were submitting issues when they tried to express a query I hadn't planned for. I think that this is a parallel to how the Django ORM has grown up.
I decided to rewrite. It was the best decision I could have made and I learned a ton in the process. I don't have near the number of users, so this was an acceptable path for my project. Peewee is also only an ORM, so the argument for making it a better ORM outweighs some other concerns.
While I rewrote I decided to focus entirely on the mechanics of expressing rich queries. To do this I took a look at all the atoms that comprise a SQL query, how they interact, and how they are composed. I identified a few things:
foo = "Bar")
Clauses are things like SELECT and FROM which denote particular parts of the query -- these I decided to expose as methods on a class. Columns, or field instances, would be exposed as class attributes. Scalars and parameters were easy, just represent them using python's various types. Finally functions, since they are so diverse and accept varying numbers of parameters, would be exposed dynamically.
This allows us to write the above "events" query:
Event.select( Event.event_type, fn.Sum(Event.end_time - Event.start_time).alias('total_time') ).group_by(Event.event_type)
The clauses accept as their arguments one or more columns, scalars, functions, or any combination thereof. These components can be combined using logical OR and AND to create query trees. They also support common operations like addition and subtraction, allowing you to express atomic updates and things like the above example where we subtract one column from another.
Peewee shed about a third of its code-base during the rewrite, going from 2400 SLOC to around 1600! It also became much more expressive -- more than once I have written a fairly complex query and been pleasently surprised to see that it just works. Rather than losing functionality, I have gained flexibility which in turn produces functionality.
I think that Django would benefit from a similar rewrite. The ORM is one of the most complicated parts of Django and sees some pretty crazy bugs. I'm sure I'm not the only person who thinks this should happen...the problem is:
How can the ORM change without breaking backwards compatibility?
From the beginning Django has been committed to backwards compatibility. This may be one of the biggest contributors to Django's adoption.
I would suggest building out a new API that is similar to peewee's. Since the existing APIs are a subset of the functionality possible in peewee, the existing APIs could be rewritten to use the new APIs and marked for deprecation. When rewriting peewee, I included a backwards-compatible method to allow the "django-style" double-underscore querying that had been possible in the older version.
What is your take on the Django ORM? If you use something like SQLAlchemy or peewee, how do you feel that it compares to Django's ORM? Are there things Django can do that Peewee or SQLAlchemy cannot? One example is peewee does not support "prefetch_related" (Full disclosure, I think this feature is kind of gross).
I think that the design of the Django ORM is a simultaneous strength and weakness. You've already pointed out the weakness, which is that it chokes on queries that are asking the database to do some of the procedural work. I'd point out that this might also be considered a strength of the ORM though.
The things that the ORM is bad at are all tasks that can be done by simply iterating over a queryset and doing your procedural work (like computing sums) in Python instead of in SQL. There's alot of good sense behind this. Python is already an elegant language for this type of procedural logic. The ORM is really designed to get you out of SQL and into Python as quickly as possible.
Is there a downside to this? Possibly a performance hit. Sometimes. Depending on the query. I expect it will very rarely be enough of a performance hit that you care about it, and when it is there is always the option of dropping down into raw SQL if you're really sure that your database is much better optimized for the task than Python is.
The peewee API is an improvement over the Django API. I have worked with a few ORM's, and even wrote one of my own... but over time I have come to appreciate SQL more. My preferred approach now is to stay close to that level, and just write simple helpers and convenience functions for building SQL queries rather than trying to devise an entire DSL on top of SQL, which is already a DSL. :)
As pointed out, SQL itself is a DSL. A wrapper over it might give you syntactical relief (from SQL), less verbosity and a coherent way of representation at application level, but might not increase the usability. The inconsistency pointed out actually makes the ORM very easy to use for developing applications and creating simple queries. 90-95% of most SQL queries in an application are covered by filter, Q, F etc. If I were to replace it with an ORM which is a consistent, expressive but lacks ease, then it might affect productivity. May be I am old school, but if I want to write a complex query, I would want to use SQL instead of any ORM.
Somehow I both agree and disagree with the point you made. Yes, Django's ORM is far from being perfect, it evolved over time and developed inconsistencies and various quirks on the way.
But as long as the code is readable and solves the problem at hand, I'm okay with it. Python itself is inconsistent in many ways, and Python's stdlib... don't get me started on this :) But what matters most is practical applicability, as opposed to functional purity.
Pragmatic approach is a distinctive feature of both Python and Django, and small(ish) deficiencies that come along with it are, at worst, tolerable.
Wow, woke up to all this feedback - I feel like I got an early Christmas present! Thanks everyone for reading and sharing their thoughts.
Just to clear the air, it wasn't my intention to start an ORM vs SQL debate, not even goin' there.
Dev and Mark V, you both seem to agree that changing the django ORM now would affect its ease-of-use. That's a valid argument, since a ton of people already know and use it, and it makes sense to them. I would just remind you that the Django project's own stated goals are in conflict with the design of its ORM. I also pointed out that the existing APIs could remain, since they would be a subset of any new, more powerful ones.
To DBG, I would just ask if you would rather "do it in python", why not use shelve or pickle then? The reason most people use a relational database is because it is insanely optimized for performing ad-hoc queries. If you are going to avoid using its best features, what's the point?
Have you looked at other ORMs, namely rails' active_record? Don't get me wrong, I'm not saying AR is the best orm or that everyone should switch to rails. But, some of the concepts like methods for clauses, are more or less in practice. Maybe there are other [orm] design decisions that will fit peewee's style.
Dude: believe it or not, you are not the first to ask this question! There is a somewhat entertaining thread on Reddit where the author of SQA and I talk a bit about the difference between the two ORMs. As I stated in the comments, peewee is ~1,600 SLOC compared to SQLAlchemy's 75,000. Peewee is made to be hackable, simple, and straightforward -- you can read the entire source code in one sitting. And yet for all that, I feel it is a powerful and expressive ORM.
Fun read - both the article and comments.
While raw SQL can be both concise and powerful, I try to use it sparingly because doing so can limit portability. Moving logic into Python can cause a bit of a performance hit, but it least it can run regardless of the DB platform.
Since we have to work with what django has and not replace it, it'd probably be good if django had a meta goal of consistancy which bugs could be filed under.
Bugs for all the inconsistencies mentioned here would be filed under it.
For my own wishlist I'd get rid of Managers and just have QuerySets.
Commenting has been closed, but please feel free to contact me