Generating aggregate data across generic relations

May 22, 2010 19:22 / aggregation django gfk / 7 comments

Edit: I've created a github repo for performing generic aggregation and annotation based on the code from this entry

Aggregation support was added to django's ORM in version 1.1, allowing you to generate Sums, Counts, and more without having to write any SQL. According to the docs aggregation is not supported for generic relations. This entry describes how to work around this using the .extra() method.

The state of the art

To take an example from the docs, it is possible to span relationships when performing aggregations:

>>> Store.objects.annotate(min_price=Min('books__price'), max_price=Max('books__price'))

Here we are querying the Store object and annotating the result set with two extra attributes, 'min_price' and 'max_price', which contain the minimum and maximum price of books that are sold at that store. Conversely, if we want to find the minimum and maximum book price over the entire queryset, you would write:

>>> Store.objects.aggregate(min_price=Min('books__price'), max_price=Max('books__price'))

The aggregate() method returns a dictionary as opposed to a queryset. This is an incredibly clean API!

Suppose you want to aggregate across a GFK

This is tricky. A generic foreign key is comprised of two attributes, a ContentType and a foreign key. The models that are GFKed to do not contain a reverse relationship to the GFK model by default. This is a little obscure, but basically it means that you can create a django Comment on any object since it supports generic relations but you can't go from any model to their associated comments (without creating a reverse generic relation).

So assume you have a simple weblog and would like to sort entries by which have the most comments. Unfortunately, this does not work:

class Entry(models.Model):
    ...
    comments = generic.GenericRelation(Comment) # reverse generic relation
    ...


Entry.objects.annotate(count=Count('comments')) # does not work!

There's a ticket for this, but it is marked for the 1.3 milestone. So how I solved this in the meantime was to use the extra method. Here's what it might look like:

from django.db.models import Count
from django.contrib.comments.models import Comment
from django.contrib.contenttypes.models import ContentType
from blog.models import Entry

ctype = ContentType.objects.get_for_model(Entry)
qs = Entry.objects.extra(select={
    'count': """
        SELECT COUNT(*) AS comment_count
        FROM django_comments
        WHERE
            content_type_id=%s AND
            object_pk=CAST(blog_entries.id as text)
        """
    },
    select_params=[ctype.pk],
    order_by=['-count'])

This code essentially performs a subquery for every entry returned, which calculates the number of comments associated with that entry. Two 'magical' things are happening here:

Inside the CAST() we're referring to blog_entries.id -- this is a field that is being retrieved by default, outside of the inner query.
The order_by is inside the extra() -- note how the COUNT() function is returning its result as "comment_count", but the select dictionary is keyed using "count". It is the key we specify that can be used for custom ordering, as opposed to whatever is used in the query.

Genericizing it

def generic_annotate(queryset, gfk_field, aggregate_field, aggregator=models.Sum, desc=True):
    ordering = desc and '-score' or 'score'
    content_type = ContentType.objects.get_for_model(queryset.model)

    # collect the params we'll be using
    params = (
        aggregator.name, # the function that's doing the aggregation
        aggregate_field, # the field containing the value to aggregate
        gfk_field.model._meta.db_table, # table holding gfk'd item info
        gfk_field.ct_field, # the content_type field on the GFK
        content_type.pk, # the content_type id we need to match
        gfk_field.fk_field, # the object_id field on the GFK
        queryset.model._meta.db_table, # the table and pk from the main
        queryset.model._meta.pk.name   # part of the query
    )

    queryset = queryset.extra(select={
        'score': """
            SELECT %s(%s) AS aggregate_score
            FROM %s
            WHERE
                %s_id=%s AND
                %s=%s.%s
        """ % params
    },
    order_by=[ordering])

    return queryset

>>> from misc import generic_annotate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_annotate(Entry.objects.all(), TaggedItem.object, 'id', Count)
>>> qs[0].score
5L

>>> qs[1].score
4L

>>> qs[1].tags
u'databases django many-to-many python'

Note that this example works for situations in which you'd use annotate(), but it doesn't allow you to mimic aggregate(), which is useful for generating summary-type data about items in a queryset. This is not too difficult to accomplish - simply invert the logic a bit so that we generate the aggregate on the outside and the result set on the inside:

def generic_aggregate(queryset, gfk_field, aggregate_field, aggregator=models.Sum):
    content_type = ContentType.objects.get_for_model(queryset.model)

    queryset = queryset.values_list('pk') # just the pks
    sql = queryset.query.as_sql()[0] % queryset.query.as_sql()[1]

    # collect the params we'll be using
    params = (
        aggregator.name, # the function that's doing the aggregation
        aggregate_field, # the field containing the value to aggregate
        gfk_field.model._meta.db_table, # table holding gfk'd item info
        gfk_field.ct_field, # the content_type field on the GFK
        content_type.pk, # the content_type id we need to match
        gfk_field.fk_field, # the object_id field on the GFK
        sql
    )

    query = """
            SELECT %s(%s) AS aggregate_score
            FROM %s
            WHERE
                %s_id=%s AND
                %s IN (
                    %s
                )
        """ % params

    cursor = connection.cursor()
    cursor.execute(query)
    row = cursor.fetchone()

    return row[0]

>>> from misc import generic_aggregate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_aggregate(Entry.objects.all(), TaggedItem.object, 'id', Count)
106L # the total number of times a tag was added to an entry

More!

Check out the project that generated this entry, django-simple-ratings -- more to come on this!
Ticket 10870
Aggregation docs
GFK docs

Comments (7)

Charles | may 31 2010, at 02:44pm

Damon - I went ahead and created a repo on github with this code (cleaned up and now with tests!). I also added support for filtering on the GFK queryset:

Commit b41d9 on django-generic-aggregation

Charles | may 31 2010, at 10:10am

Damon - that's an interesting problem and one that should probably be addressed in some fashion by the code. Additionally, what if you get a lot of spam - you'd probably want to filter on most commented where the comment "is_public". I'll see about working up a solution!

Thanks for the comments!

Damon | may 29 2010, at 12:39pm

Ooh - one more wish/request ... could one add a timeframe to the SQL ?

I am using this in terms of comments ... if I wanted to return a queryset with the most commented item in the last 24 hours first, could one do that?

I am not versed in SQL.

Thanks again - it's solved the first quandry for me.

Damon

Damon | may 29 2010, at 12:28pm

Hi Charles - this seems to work very well. Thanks!

I wondered if there was a way to add a 'secondary' sort method to the generic_aggregate function?

For example, to sort first by the "score" and then by the date ? A passable field ? I tried to add it to the extra bit but that didn't work.

Thanks again.

Damon

Charles Leifer | may 24 2010, at 12:34pm

NiKo -- I tested the SQL with sqlite and postgres, and assume it will work with MySQL. It would be nice to get away from the raw sql / unescaped elements and I will post an update when I get a chance to make it better. Thanks for reading, glad this worked for you!

NiKo | may 24 2010, at 11:33am

There you solved it, thanks!

I had hard time with counting gfk, if you want to earn badges you can answer my entry on stackoverflow ;) http://stackoverflow.com/questions/2754320/problem-trying-to-achieve-a-join-using-the-comments-contrib-in-django

Just a question though, will the sql statements in the subqueries be compatible with all rdbms supported by the Django ORM?

memo | may 24 2010, at 01:21am

This is pretty interesting! I've tried to do similiar thing a while ago, but failed. So thanks for sharing this.

Commenting has been closed.