Generating aggregate data across generic relations
Edit: I've created a github repo for performing generic aggregation and annotation based on the code from this entry
Aggregation support was added to django's ORM in version 1.1, allowing you to generate Sums, Counts, and more without having to write any SQL. According to the docs aggregation is not supported for generic relations. This entry describes how to work around this using the .extra() method.
The state of the art
To take an example from the docs, it is possible to span relationships when performing aggregations:
>>> Store.objects.annotate(min_price=Min('books__price'), max_price=Max('books__price'))
Here we are querying the Store object and annotating the result set with two extra attributes, 'min_price' and 'max_price', which contain the minimum and maximum price of books that are sold at that store. Conversely, if we want to find the minimum and maximum book price over the entire queryset, you would write:
>>> Store.objects.aggregate(min_price=Min('books__price'), max_price=Max('books__price'))
The aggregate() method returns a dictionary as opposed to a queryset. This is an incredibly clean API!
Suppose you want to aggregate across a GFK
This is tricky. A generic foreign key is comprised of two attributes, a ContentType and a foreign key. The models that are GFKed to do not contain a reverse relationship to the GFK model by default. This is a little obscure, but basically it means that you can create a django Comment on any object since it supports generic relations but you can't go from any model to their associated comments (without creating a reverse generic relation).
So assume you have a simple weblog and would like to sort entries by which have the most comments. Unfortunately, this does not work:
class Entry(models.Model):
...
comments = generic.GenericRelation(Comment) # reverse generic relation
...
Entry.objects.annotate(count=Count('comments')) # does not work!
There's a ticket for this, but it is marked for the 1.3 milestone. So how I solved this in the meantime was to use the extra method. Here's what it might look like:
from django.db.models import Count
from django.contrib.comments.models import Comment
from django.contrib.contenttypes.models import ContentType
from blog.models import Entry
ctype = ContentType.objects.get_for_model(Entry)
qs = Entry.objects.extra(select={
'count': """
SELECT COUNT(*) AS comment_count
FROM django_comments
WHERE
content_type_id=%s AND
object_pk=CAST(blog_entries.id as text)
"""
},
select_params=[ctype.pk],
order_by=['-count'])
This code essentially performs a subquery for every entry returned, which calculates the number of comments associated with that entry. Two 'magical' things are happening here:
- Inside the CAST() we're referring to blog_entries.id -- this is a field that is being retrieved by default, outside of the inner query.
- The order_by is inside the extra() -- note how the COUNT() function is returning its result as "comment_count", but the select dictionary is keyed using "count". It is the key we specify that can be used for custom ordering, as opposed to whatever is used in the query.
Genericizing it
def generic_annotate(queryset, gfk_field, aggregate_field, aggregator=models.Sum, desc=True):
ordering = desc and '-score' or 'score'
content_type = ContentType.objects.get_for_model(queryset.model)
# collect the params we'll be using
params = (
aggregator.name, # the function that's doing the aggregation
aggregate_field, # the field containing the value to aggregate
gfk_field.model._meta.db_table, # table holding gfk'd item info
gfk_field.ct_field, # the content_type field on the GFK
content_type.pk, # the content_type id we need to match
gfk_field.fk_field, # the object_id field on the GFK
queryset.model._meta.db_table, # the table and pk from the main
queryset.model._meta.pk.name # part of the query
)
queryset = queryset.extra(select={
'score': """
SELECT %s(%s) AS aggregate_score
FROM %s
WHERE
%s_id=%s AND
%s=%s.%s
""" % params
},
order_by=[ordering])
return queryset
>>> from misc import generic_annotate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_annotate(Entry.objects.all(), TaggedItem.object, 'id', Count)
>>> qs[0].score
5L
>>> qs[1].score
4L
>>> qs[1].tags
u'databases django many-to-many python'
Note that this example works for situations in which you'd use annotate(), but it doesn't allow you to mimic aggregate(), which is useful for generating summary-type data about items in a queryset. This is not too difficult to accomplish - simply invert the logic a bit so that we generate the aggregate on the outside and the result set on the inside:
def generic_aggregate(queryset, gfk_field, aggregate_field, aggregator=models.Sum):
content_type = ContentType.objects.get_for_model(queryset.model)
queryset = queryset.values_list('pk') # just the pks
sql = queryset.query.as_sql()[0] % queryset.query.as_sql()[1]
# collect the params we'll be using
params = (
aggregator.name, # the function that's doing the aggregation
aggregate_field, # the field containing the value to aggregate
gfk_field.model._meta.db_table, # table holding gfk'd item info
gfk_field.ct_field, # the content_type field on the GFK
content_type.pk, # the content_type id we need to match
gfk_field.fk_field, # the object_id field on the GFK
sql
)
query = """
SELECT %s(%s) AS aggregate_score
FROM %s
WHERE
%s_id=%s AND
%s IN (
%s
)
""" % params
cursor = connection.cursor()
cursor.execute(query)
row = cursor.fetchone()
return row[0]
>>> from misc import generic_aggregate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_aggregate(Entry.objects.all(), TaggedItem.object, 'id', Count)
106L # the total number of times a tag was added to an entry
More!
- Check out the project that generated this entry, django-simple-ratings -- more to come on this!
- Ticket 10870
- Aggregation docs
- GFK docs
Comments (7)
Charles | may 31 2010, at 10:10am
Damon - that's an interesting problem and one that should probably be addressed in some fashion by the code. Additionally, what if you get a lot of spam - you'd probably want to filter on most commented where the comment "is_public". I'll see about working up a solution!
Thanks for the comments!
Damon | may 29 2010, at 12:39pm
Ooh - one more wish/request ... could one add a timeframe to the SQL ?
I am using this in terms of comments ... if I wanted to return a queryset with the most commented item in the last 24 hours first, could one do that?
I am not versed in SQL.
Thanks again - it's solved the first quandry for me.
Damon
Damon | may 29 2010, at 12:28pm
Hi Charles - this seems to work very well. Thanks!
I wondered if there was a way to add a 'secondary' sort method to the generic_aggregate
function?
For example, to sort first by the "score" and then by the date ? A passable field ? I tried to add it to the extra
bit but that didn't work.
Thanks again.
Damon
Charles Leifer | may 24 2010, at 12:34pm
NiKo -- I tested the SQL with sqlite and postgres, and assume it will work with MySQL. It would be nice to get away from the raw sql / unescaped elements and I will post an update when I get a chance to make it better. Thanks for reading, glad this worked for you!
NiKo | may 24 2010, at 11:33am
There you solved it, thanks!
I had hard time with counting gfk, if you want to earn badges you can answer my entry on stackoverflow ;) http://stackoverflow.com/questions/2754320/problem-trying-to-achieve-a-join-using-the-comments-contrib-in-django
Just a question though, will the sql statements in the subqueries be compatible with all rdbms supported by the Django ORM?
memo | may 24 2010, at 01:21am
This is pretty interesting! I've tried to do similiar thing a while ago, but failed. So thanks for sharing this.
Commenting has been closed.
Charles | may 31 2010, at 02:44pm
Damon - I went ahead and created a repo on github with this code (cleaned up and now with tests!). I also added support for filtering on the GFK queryset:
Commit b41d9 on django-generic-aggregation