May 03, 2012 00:05 / 0 comments / django gfk

In this post I want to discuss how to work around some of the shortcomings of djangos ORM when dealing with Generic Foreign Keys (GFKs).

At the end of the post I'll show how to work around django's lack of correctly CAST-ing when the generic foreign key is of a different column type than the objects it may point to.

A quick primer on content-types and GFKs

If the Generic Foreign Key did not exist, it would be necessary to invent it

Thanks to the content-types framework, part of django.contrib, we do not have to do any inventing, however. The content-types framework is an app that is responsible for mapping python models to the database layer -- a dirty job, but it makes a number of other things easier to implement. Content-types are used to provide granular permissions via the auth/permissions framework and, notably, they have been used to implement GFKs.

A GFK is simply a foreign key to a content-type and an additional column to store a related primary key. It is not really a foreign key at all in the sense of it being an actual database constraint. Nor is it a foreign key in the same sense as django's db.models.ForeignKey, because the ORM barely supports querying against them and offers no "on_delete" support (which often results in orphaned rows or messy signal handlers).

You cannot perform annotations or aggregations using the new querying APIs (actually, you can with a little work), and to traverse a reverse generic relation (akin to the related_name attribute of a ForeignKey), you need to add a custom manager to your model -- not very DRY.

Here are some tidbits from the django docs

Why use GFKs at all?

Whether you like it or not, the GFK can be a useful tool. It is often purported to make apps "reusable" -- by not specifying a database-level constraint, you leave implementers some freedom in choosing their modelling. Consider the django.contrib.comments app -- it uses a GFK so that the same comments app can be used on any model, even on itself. The comments are all stored in a single database table and are all accessible using the same API and via the same admin interface. GFKs have also been used for some silly hacks.

Getting to the interesting stuff

Coming around to the point of this post, I think we can all agree that GFKs are at best a gross hack that makes some tasks easier. Yet as gross as they are, they are still implemented at the database level, so even though the "stock" ORM does not offer many niceties in the way of dealing with GFKs we can easily use some of django's lesser-known features to build some nice APIs on top. The rest of the post will be in a "problem/solution" format. I tested the example code against django 1.2, 1.3.1 and current master (ddfc7c253019).

For reference, I'll pretend we're talking about a generic rating model that looks something like this:

class Rating(models.Model):
    rating = models.IntegerField()
    object_id = models.IntegerField()
    content_type = models.ForeignKey(ContentType)
    content_object = GenericForeignKey(ct_field='content_type', fk_field='object_id')

class Food(models.Model):
    name = models.CharField(max_length=50)
    ratings = generic.GenericRelation(Rating) # reverse generic relation

Filtering ratings to a particular subset of rated items

This is not as easy as it sounds at first. Let's say we want to get ratings on foods that start with the letter "A". You might try the following:

>>> Rating.objects.filter(content_object__name__startswith="a")

The problem is, maybe not all "content objects" have a "name" column, so how should django know that you mean just foods? There is no API you can use to "hint" at this, though you could try:

>>> Rating.objects.filter(content_type=food_ctype, content_object__name__startswith="a")

You'll still get the field error, of course.

The more experienced django programmer's reflex might be to use a subquery:

>>> a_foods = Food.objects.filter(name__startswith='a')
>>> Rating.objects.filter(content_object__in=a_foods)

And eventually end up at:

>>> Rating.objects.filter(
        content_type=food_ctype,
        object_id__in=a_foods.values('id')
    )

Yes, this will work, provided Rating.object_id is of the same type as Food.id (both are integers, so we're good). Note that we are no longer querying against the fake content_object field. The SQL can end up looking pretty funky though, since that subquery is not converted into a database-level JOIN. The generated code looks something like:

SELECT "app_rating"."id", "app_rating"."rating", "app_rating"."object_id", "app_rating"."content_type_id" 
FROM "app_rating" 
WHERE (
    "app_rating"."object_id" IN (
        SELECT U0."id" FROM "app_food" U0 
        WHERE U0."name"::text LIKE 'a%' ) AND 
    "app_rating"."content_type_id" = 7
);

I populated the database with 30,000 foods, 10,000 for each letter "a", "b", "c", and a rating for each food.

The fact is, though, that we can express this using a JOIN:

SELECT "app_rating"."id", "app_rating"."rating", "app_rating"."object_id", "app_rating"."content_type_id" 
FROM "app_rating"
INNER JOIN "app_food"
    ON "app_rating"."object_id" = "app_food"."id"
WHERE
    "app_rating"."content_type_id" = 7 AND
    "app_food"."name"::text LIKE 'a%'

Somewhat to my surprise, the subquery can actually be more performant when using postgresql's query planner, which can optimize the subquery to a hash semi join. Conventional wisdom states that joins are better than subqueries, because the query planner can make optimizations if it knows what it is joining on, yet it seems at least with postgres, that either way can yield good results.

Subquery

 Hash Semi Join  (cost=662.00..1525.00 rows=10000 width=16) (actual time=13.339..25.814 rows=10000 loops=1)
   Hash Cond: (app_rating.object_id = u0.id)
   ->  Seq Scan on app_rating  (cost=0.00..538.00 rows=30000 width=16) (actual time=0.017..4.416 rows=30000 loops=1)
         Filter: (content_type_id = 7)
   ->  Hash  (cost=537.00..537.00 rows=10000 width=4) (actual time=13.303..13.303 rows=10000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 352kB
         ->  Seq Scan on app_food u0  (cost=0.00..537.00 rows=10000 width=4) (actual time=0.011..9.652 rows=10000 loops=1)
               Filter: ((name)::text ~~ 'a%'::text)
 Total runtime: 26.216 ms
(9 rows)

Join

 Hash Join  (cost=662.00..1750.00 rows=10000 width=16) (actual time=14.001..30.933 rows=10000 loops=1)
   Hash Cond: (app_rating.object_id = app_food.id)
   ->  Seq Scan on app_rating  (cost=0.00..538.00 rows=30000 width=16) (actual time=0.016..5.490 rows=30000 loops=1)
         Filter: (content_type_id = 7)
   ->  Hash  (cost=537.00..537.00 rows=10000 width=4) (actual time=13.964..13.964 rows=10000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 352kB
         ->  Seq Scan on app_food  (cost=0.00..537.00 rows=10000 width=4) (actual time=0.011..10.382 rows=10000 loops=1)
               Filter: ((name)::text ~~ 'a%'::text)
 Total runtime: 31.576 ms
(9 rows)

The problem with examples like these is that they are taken as a rule of thumb -- there may be instances when the subquery is a fine option, or there may be instances where it will need to be expressed as a JOIN (particularly if using a RDBMS that does a lousy job with subqueries). At any rate, I hope this shows that you can approach this type of query either way -- via the ORM w/a subquery or via SQL using a JOIN.

Annotating records, or getting the highest-rated items

Suppose you want to list all the foods ordered by rating. Using django's annotate() method, it should be easy:

>>> Food.objects.annotate(avg_score=Avg('ratings__rating')) # does not work correctly

The generated query ends up looking like:

SELECT "app_food"."id", "app_food"."name", AVG("app_rating"."rating") AS "avg_score"
FROM "app_food" 
LEFT OUTER JOIN "app_rating"
    ON ("app_food"."id" = "app_rating"."object_id")
GROUP BY "app_food"."id", "app_food"."name"

The missing piece is the additional filter on the content-type -- the JOIN is correct, but if there are any ratings on other content-types that have object_ids that collide with food ids, the results will be off.

What happens if we try to get clever and query on the content type explicitly? The generated SQL is a bit odd, but the results will be correct:

>>> ctype = ContentType.objects.get_for_model(Food)
>>> foods = Food.objects.filter(ratings__content_type=ctype)
>>> foods = foods.annotate(avg_score=Avg('ratings__rating'))

This generates the following query, which is correct though it contains a duplicate where clause:

SELECT "app_food"."id", "app_food"."name", AVG("app_rating"."rating") AS "avg_score"
FROM "app_food"
LEFT OUTER JOIN "app_rating"
    ON ("app_food"."id" = "app_rating"."object_id")
WHERE (
    "app_rating"."content_type_id" = 7  AND 
    "app_rating"."content_type_id" = 7 
) GROUP BY "app_food"."id", "app_food"."name"

As an alternative, you can express this as a subquery using django's extra() method on queryset. The implementation is fairly straightforward if a little unconventional since it requires writing some SQL. The subquery will be a little different this time in that it will reference the row from the outer query. Here is how you do this in django:

>>> extra_select = """
    SELECT Avg("rating") AS aggregate_score
    FROM "app_rating"
    WHERE
        "content_type_id"=%s AND
        "object_id"="app_food"."id"
"""
>>> Food.objects.all().extra(
    select={'avg_score': extra_select},
    select_params=[ctype.id],
)

Taking a look at the entire query, we can see how django ties in the extra select query, as well as how our reference to "app_food"."id" maps to the outer query:

SELECT (
    SELECT Avg("rating") AS aggregate_score
    FROM "app_rating"
    WHERE
        "content_type_id"=7 AND
        "object_id"="app_food"."id"
) AS "avg_score", "app_food"."id", "app_food"."name"
FROM "app_food"
ORDER BY "avg_score" DESC

In this case, the JOIN performed much better than the subquery. Here are the two query plans:

Join

 Sort  (cost=8595.80..8670.80 rows=30000 width=13) (actual time=135.585..140.832 rows=30000 loops=1)
   Sort Key: (avg(app_rating.rating))
   Sort Method: external merge  Disk: 728kB
   ->  GroupAggregate  (cost=5175.40..5850.40 rows=30000 width=13) (actual time=58.430..85.801 rows=30000 loops=1)
         ->  Sort  (cost=5175.40..5250.40 rows=30000 width=13) (actual time=58.413..63.321 rows=30000 loops=1)
               Sort Key: app_food.id, app_food.name
               Sort Method: external merge  Disk: 760kB
               ->  Hash Join  (cost=984.00..2430.00 rows=30000 width=13) (actual time=14.967..38.724 rows=30000 loops=1)
                     Hash Cond: (app_rating.object_id = app_food.id)
                     ->  Seq Scan on app_rating  (cost=0.00..538.00 rows=30000 width=8) (actual time=0.019..5.853 rows=30000 loops=1)
                           Filter: (content_type_id = 7)
                     ->  Hash  (cost=462.00..462.00 rows=30000 width=9) (actual time=14.922..14.922 rows=30000 loops=1)
                           Buckets: 4096  Batches: 2  Memory Usage: 615kB
                           ->  Seq Scan on app_food  (cost=0.00..462.00 rows=30000 width=9) (actual time=0.011..5.447 rows=30000 loops=1)
 Total runtime: 142.461 ms
(15 rows)

Subquery

 Sort  (cost=18393657.40..18393732.40 rows=30000 width=9) (actual time=108387.761..108395.111 rows=30000 loops=1)
   Sort Key: ((SubPlan 1))
   Sort Method: external merge  Disk: 816kB
   ->  Seq Scan on app_food  (cost=0.00..18390912.00 rows=30000 width=9) (actual time=9.570..108281.379 rows=30000 loops=1)
         SubPlan 1
           ->  Aggregate  (cost=613.01..613.02 rows=1 width=4) (actual time=3.607..3.607 rows=1 loops=30000)
                 ->  Seq Scan on app_rating  (cost=0.00..613.00 rows=1 width=4) (actual time=1.804..3.602 rows=1 loops=30000)
                       Filter: ((content_type_id = 7) AND (object_id = app_food.id))
 Total runtime: 108396.599 ms
(9 rows)

Holy shit, that subquery is killing us -- about 750x slower! Granted I don't have my local postgres tuned to do large sorts in memory, but still things are clearly not as efficient. This is yet another example, your mileage may vary depending on the size or your datasets, presence of indexes, query plans generated, and amount of working memory available. Or, in other words, check up on your query plans -- you may be surprised by the results.

Aggregating records to generate a total score

The last item I'll take a look at is aggregating records. This is somewhat the inverse of annotation -- rather than generating a score for every record based on some condition, I'm generating an aggregate across all records, expressed as a single scalar value. What if we wanted to know the average rating across all foods starting with "a"? Based on what we've seen so far, we could probably guess that this approach may not work exactly right:

>>> Food.objects.filter(name__startswith='a').aggregate(avg_score=Avg('ratings__rating'))

And sure enough, looking at the generated SQL the query is again missing the filter on content-type:

SELECT AVG("app_rating"."rating") AS "avg_score"
FROM "app_food"
LEFT OUTER JOIN "app_rating"
    ON ("app_food"."id" = "app_rating"."object_id")
WHERE "app_food"."name"::text LIKE 'a%'

So we'll do the same thing as above for annotate, which is to explicitly query the content_type exposed by the GenericRelation:

>>> ctype = ContentType.objects.get_for_model(Food)
>>> avg = Food.objects.filter(
    name__startswith='a',
    ratings__content_type=ctype,
).aggregate(avg_score=Avg('ratings__rating'))

Looking at the generated SQL the filter on content-type is present (although duplicated):

SELECT AVG("app_rating"."rating") AS "avg_score"
FROM "app_food"
LEFT OUTER JOIN "app_rating"
    ON ("app_food"."id" = "app_rating"."object_id")
WHERE (
    "app_food"."name"::text LIKE 'a%'  AND (
        "app_rating"."content_type_id" = 7  AND 
        "app_rating"."content_type_id" = 7
    ))

Again, we can express this as a subquery instead of a JOIN. The inner query is fairly straightforward though not easily expressed by the django ORM. What we end up with, though, would look something like this:

SELECT Avg("rating") AS aggregate_score
FROM "app_rating"
WHERE
    "content_type_id"=7 AND
    "object_id" IN (
        SELECT U0."id" FROM "app_food" U0 WHERE U0."name"::text LIKE E'a%'
    )

How do these different methods compare? They are actually nearly identical. The only difference is that the use of the JOIN causes the query planner to use a Hash Join, which is a bit costlier than the Hash Semi Join used by the subquery.

Join

 Aggregate  (cost=1775.00..1775.01 rows=1 width=4) (actual time=34.389..34.389 rows=1 loops=1)
   ->  Hash Join  (cost=662.00..1750.00 rows=10000 width=4) (actual time=13.804..32.909 rows=10000 loops=1)
         Hash Cond: (app_rating.object_id = app_food.id)
         ->  Seq Scan on app_rating  (cost=0.00..538.00 rows=30000 width=8) (actual time=0.017..5.986 rows=30000 loops=1)
               Filter: (content_type_id = 7)
         ->  Hash  (cost=537.00..537.00 rows=10000 width=4) (actual time=13.765..13.765 rows=10000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 352kB
               ->  Seq Scan on app_food  (cost=0.00..537.00 rows=10000 width=4) (actual time=0.011..10.198 rows=10000 loops=1)
                     Filter: ((name)::text ~~ 'a%'::text)
 Total runtime: 34.449 ms
(10 rows)

Subquery

 Aggregate  (cost=1550.00..1550.01 rows=1 width=4) (actual time=32.597..32.598 rows=1 loops=1)
   ->  Hash Semi Join  (cost=662.00..1525.00 rows=10000 width=4) (actual time=14.598..31.128 rows=10000 loops=1)
         Hash Cond: (app_rating.object_id = u0.id)
         ->  Seq Scan on app_rating  (cost=0.00..538.00 rows=30000 width=8) (actual time=0.016..5.988 rows=30000 loops=1)
               Filter: (content_type_id = 7)
         ->  Hash  (cost=537.00..537.00 rows=10000 width=4) (actual time=14.557..14.557 rows=10000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 352kB
               ->  Seq Scan on app_food u0  (cost=0.00..537.00 rows=10000 width=4) (actual time=0.011..10.873 rows=10000 loops=1)
                     Filter: ((name)::text ~~ 'a%'::text)
 Total runtime: 32.659 ms
(10 rows)

When the above examples fail, and how you can fix it

Note: if you're using sqlite, this will not actually be a problem for you

What happens when your model, for the sake of being SUPER generic, uses a TextField to represent the "object_id" portion of the GFK? In fact, django's own comments app does this. So let's rewrite our ratings model:

class TextRating(models.Model):
    object_id = models.TextField()
    content_type = models.ForeignKey(ContentType)
    content_object = GenericForeignKey(ct_field='content_type', fk_field='object_id')

If you're using any of the three methods I described above, get ready to see a lot of these:

DatabaseError: operator does not exist: text = integer

The reason for this is that, although django is able to in many cases generate some pretty good SQL -- it does not handle casting the object_pk correctly:

SELECT "app_textrating"."id", "app_textrating"."object_id", "app_textrating"."content_type_id" 
FROM "app_textrating" 
WHERE (
    "app_textrating"."object_id" IN (
        SELECT U0."id" 
        FROM "app_food" U0 
        WHERE U0."name"::text LIKE a% 
    ) AND 
    "app_textrating"."content_type_id" = 7
)

Django does not do this cast because it doesn't know that we are querying against a table where the object_id is going to be an integer -- it might be able to infer that based on the fact that we're restricting the queryset to a particular contenttype, which uses an integer for its primary key, but alas... What we want to see is this:

WHERE (
    CAST("app_textrating"."object_id" AS integer) IN (
        ...
    )
)

This is a problem for any of the following queries:

Filtering a set of ratings using object_id__in or similar:

TextRating.objects.filter(
    content_type=food_ctype,
    object_id__in=a_foods.values('id') # <-- missing CAST
)

Annotating records using django's annotate() api:

>>> base = Food.objects.filter(text_ratings__content_type=ctype)
>>> foods = base.annotate(avg_score=Avg('text_ratings__rating'))

Aggregating records using django's aggregate() api:

>>> base = Food.objects.filter(text_ratings__content_type=ctype)
>>> foods = base.aggregate(avg_score=Avg('text_ratings__rating'))

The reasons for the last two not working is because Django sets up a JOIN for us between Food.id and TextRating.object_id, except of course the database engine will complain that we need to CAST before doing the JOIN.

The workaround I came up with is pretty nasty and involves some abuse of django's extra(). If you'd like to take a look, the code is available on github:

https://github.com/coleifer/django-generic-aggregation/blob/master/generic_aggregation/utils.py

You can safely call the three methods presented in the library and, in the event the django filter would fail, it will fall back to a method that will return the correct results. The fallbacks are possibly less efficiently since the queries generated are different from django's, using subqueries instead of JOINs:

If you're interested in reading more, check out the docs.

Reading more

Thanks for reading, I hope the post was informative. Feel free to leave any comments or suggestions below, and if you've got any stories of wacky things you've done using GFKs please share!

Here are a couple links you might be interested in:

More like this

Comments (0)


Commenting has been closed, but please feel free to contact me