A Tour of Tagging Schemas: Many-to-many, Bitmaps and More

April 03, 2015 21:59 / peewee python sql / 0 comments

In this post I'll describe how to implement tagging with a relational database. What I mean by tagging are those little labels you see at the top of this blog post, which indicate how I've chosen to categorize the content. There are many ways to solve this problem, and I'll try to describe some of the more popular methods, as well as one unconventional approach using bitmaps. In each section I'll describe the database schema, try to list the benefits and drawbacks, and present example queries. I will use Peewee ORM for the example code, but hopefully these examples will easily translate to your tool-of-choice.

Many-to-many

My preferred method for implementing tagging is to create a many-to-many relationship using three tables. Having normalized data means that I can slice and dice my data pretty much any way imaginable. On my blog I have the following tables in a many-to-many relationship:

class BlogEntry(BaseModel):
    title = CharField()
    content = TextField()
    timestamp = DateTimeField()
    is_published = BooleanField()

class Tag(BaseModel):
    tag = CharField()

class BlogEntryTags(BaseModel):
    blog_entry = ForeignKeyField(BlogEntry)
    tag = ForeignKeyField(Tag)

Benefits

Normalized so tags and entries are not duplicated and the relationship between the two is stored in a dedicated table.
Because it is normalized, the sky's the limit as far as querying goes.
Efficiently query for objects associated with a given tag, or tags associated with a given object.

Drawbacks

More complex data model means most queries require at least one join.
Some queries may be a bit un-intuitive, for example retrieving blogs that are tagged with multiple tags.
Slower to INSERT or DELETE.

Example queries

We can query for blog entries matching a given tag, or conversely, query for the tags associated with a given blog entry. Here are example queries:

# Get blogs tagged with "python".
python_blogs = (BlogEntry
                .select()
                .join(BlogEntryTags)
                .join(Tag)
                .where(
                    (BlogEntry.is_published == True) &
                    (Tag.tag == 'python'))
                .order_by(BlogEntry.timestamp.desc()))

# Get the tags for a particular blog entry.
entry_tags = (Tag
              .select()
              .join(BlogEntryTags)
              .where(BlogEntryTags.blog_entry == some_entry)
              .order_by(Tag.tag))

Another neat thing is to get "popular" tags by listing the tags and the count of associated entries:

count = fn.COUNT(BlogEntryTag.id)
tags_with_counts = (Tag
                    .select(Tag, count.alias('entry_count'))
                    .join(BlogEntryTag)
                    .join(BlogEntry)
                    .where(BlogEntry.is_published == True)
                    .group_by(Tag)
                    .order_by(count.desc(), Tag.tag))

Querying multiple tags using AND and OR

Sometimes you may wish to retrieve objects that match more than one tag using either AND or OR logic. Implementing this with junction tables can sometimes be a bit confusing at first, so I thought I'd present examples of both AND and OR queries.

Here is how you might query for all entries that are tagged with python, peewee AND sqlite (the trinity!):

tags = ['python', 'peewee', 'sqlite']
awesome_entries = (BlogEntry
                   .select()
                   .join(BlogEntryTag)
                   .join(Tag)
                   .where(
                       (Tag.tag << tags) &
                       (BlogEntry.published == True))
                   .group_by(BlogEntry)
                   .having(fn.COUNT(BlogEntry.id) == len(tags)))

To query for entries tagged with either python, peewee OR sqlite, we can simply drop the HAVING clause:

tags = ['python', 'peewee', 'sqlite']
query = (BlogEntry
         .select()
         .join(BlogEntryTag)
         .join(Tag)
         .where(
             (Tag.tag << tags) &
             (BlogEntry.published == True))
         .group_by(BlogEntry))

Dead simple approach

The opposite way, in terms of complexity, is to store tags in a delimited CharField on the BlogEntry model. With this method no additional tables or joins are needed to perform queries for tags. Here is what the BlogEntry model looks like:

class BlogEntry(BaseModel):
    title = CharField()
    content = TextField()
    timestamp = DateTimeField()
    is_published = BooleanField()
    tags = CharField()

Benefits

Simple schema
Simple queries for entries matching tags
Simple implementation

Drawbacks

Queries may be very slow.
Limitations in terms of what types of queries are possible.
Difficult to do things like list the most popular tags.
Must rely on a convention for storing tags.

Example queries

Before I get into the examples, I'd like to point out that it's necessary to agree beforehand on some type of delimiter to separate the tags in the column. Since these queries will use the LIKE operator (which does partial string matching with wildcards), we need to surround each tag with a delimiter.

If we did not use pipes as delimiters and searched for BlogEntry.tags LIKE '%python%', our search might incorrectly return entries tagged with python-nugget or greenpython (not actual things).

In these examples I'll assume that tags are delimited by pipes |, and that a pipe character is also placed at the beginning and end of the tags column, so we get values like:

|tag1|tag2|tag3|
|python|peewee|
|| (no tags)

Here is an example of how we might query for all entries matching the python tag:

python_entries = (BlogEntry
                  .select()
                  .where(
                      (BlogEntry.is_published == True) &
                      BlogEntry.tags.contains('|python|')))

To return the list of tags for a given entry, we simply use string operations to clean up the data stored in the BlogEntry.tags column:

tags = some_entry.tags.strip('|').split('|')

To search for multiple tags, you can just AND- or OR-together multiple calls to Blog.tags.contains(), e.g.

awesome_entries = (BlogEntry
                   .select()
                   .where(
                       (BlogEntry.is_published == True) &
                       BlogEntry.tags.contains('|python|') &
                       BlogEntry.tags.contains('|peewee|') &
                       BlogEntry.tags.contains('|sqlite|')))

Getting crazy with bitmaps

For some scenarios, it might make sense to store tags in a bitmap. Each Tag would correspond to a power of 2, and each BlogEntry would have a tags field that stored an integer corresponding to the mask of the selected tags. For applications with a small number of tags where you're interested in calculating lots of unions and intersections, this might be a good approach! I originally intended to add this at the end of the post as a suggested alternative implementation, but I was too intrigued and decided to try my hand at implementing it. Apologies in advance if this is a little rough around the edges!

Our models will look roughly the same as the dead-simple example, except each Tag will store a special identifier column, and the BlogEntry.tags field will be an IntegerField:

class Tag(BaseModel):
    tag = CharField()
    identifier = IntegerField()

class BlogEntry(BaseModel):
    title = CharField()
    content = TextField()
    timestamp = DateTimeField()
    is_published = BooleanField()
    tags = IntegerField(index=True)

Benefits

Efficiently find entries matching one or more tags.
Efficiently find entries matching groups of tags (unions and intersections).
Low storage overhead.
Fast (?)

Drawbacks

Seems like there might be a pretty low upper-bound on the number of tags you can have. If an IntegerField stores 64-bits, then you have a max of 64 unique tags.
Complexity of code.

Example queries

First I'd like to cover how we might add tags to the table, since we need to ensure that each tag has a unique identifier that is a power of 2. The simplest way I thought of for doing this is to simply take the max identifier in the table and multiply it by 2. We need to add an additional check so that when we insert the very first tag it will have a default value of 1.

class Tag(BaseModel):
    # ... tag fields ...

    @classmethod
    def add_tag(cls, tag):
        new_tag = Tag.create(
            tag=tag,
            identifier=fn.COALESCE(
                Tag.select(fn.MAX(Tag.identifier) * 2),
                1))
        # Re-fetch the newly-created tag so the identifier
        # is populated with the value.
        return Tag.get(Tag.id == new_tag.id)

When assigning tags to a blog entry, we will need to OR-together the identifiers of the tags. Here is some example code:

class BlogEntry(BaseModel):
    # ... blog entry fields ...

    def set_tags(self, *tags):
        self.tags = 0
        for tag in tags:
            self.tags |= tag.identifier

We can use these functions to create two tags and assign them to a blog entry like this:

>>> python = Tag.add_tag('python')
>>> sqlite = Tag.add_tag('sqlite')
>>> some_entry = BlogEntry(
...     title='One weird trick to use Python with SQLite',
...     content='Developers fear him! Man creates brain-dead system...')
...
>>> some_entry.set_tags(python, sqlite)
>>> some_entry.save()

Here is how you might retrieve the entries tagged with python:

py_ident = Tag.get(Tag.tag == 'python').identifier
py_entries = (BlogEntry
              .select()
              .where(BlogEntry.tags.bin_and(py_ident) != 0))

We can also express this in a single query using a JOIN:

query = (BlogEntry
         .select()
         .join(Tag, on=(BlogEntry.tags.bin_and(Tag.identifier) != 0))
         .where(Tag.tag == 'python'))

To determine which tags are associated with a blog post, I opted for a little binary arithmetic. I'm not sure if this is the most clever way to do this, so if you have a better suggestion don't hesitate to leave a comment!

class BlogEntry(BaseModel):
    # ... blog entry fields and set_tags() method ...

    def get_tags(self):
        tag_val = self.tags
        i = 1
        identifiers = []
        while tag_val != 0:
            if tag_val & 1:
                identifiers.append(i)
            i <<= 1  # Increase `i` to the next power of 2.
            tag_val >>= 1  # Pop the right-most bit off of tagval.
        return Tag.select().where(Tag.identifier.in_(identifiers))

A cool feature of this design is that we can really easily find blog entries that match multiple tags. For instance let's say we want blog entries tagged with python, peewee and sqlite. We can write the following query:

# Subquery to calculate sum of identifiers. Note that this is not executed here!
tsum = (Tag
        .select(fn.SUM(Tag.identifier))
        .where(Tag.tag << ['python', 'peewee', 'flask'])
        .alias('tsum'))  # Alias we will refer to in BlogEntry query.

# Entries tagged with 'python', 'peewee' AND 'flask'.
query = (BlogEntry
         .select(BlogEntry, tsum)  # Note we select BlogEntry and tsum.
         .where(BlogEntry.tags.bin_and(SQL('tsum')) == SQL('tsum')))

Similarly, to query blog entries tagged with either python, peewee, or sqlite, we can simply check for a non-zero result of the bin_and() call:

# Same as previous example.
tsum = (Tag
        .select(fn.SUM(Tag.identifier))
        .where(Tag.tag << ['python', 'peewee', 'flask'])
        .alias('tsum'))  # Alias we will refer to in BlogEntry query.

# Entries tagged with 'python', 'peewee' OR 'flask'.
query = (BlogEntry
         .select(BlogEntry, tsum)  # Note we select BlogEntry and tsum.
         .where(BlogEntry.tags.bin_and(SQL('tsum')) != 0))

To be honest, I'm kind of surprised how well this worked! I think this is a really fun way of implementing tags.

Tagging multiple content-types

Difficulties arise when we have multiple content types we wish to tag, for instance if I wished to add a photos table and allow photos to be tagged as well. The rest of this post will describe ways to tag multiple content-types.

We'll assume the following models for the rest of the post:

class BlogEntry(BaseModel):
    title = CharField()
    content = TextField()
    timestamp = DateTimeField()
    is_published = BooleanField()

class Photo(BaseModel):
    filename = CharField()
    description = TextField()

class Tag(BaseModel):
    tag = CharField()

Fat junction table

As shown in the first section, one way to related blog entries to tags is with a many-to-many junction table. We can allow tags to be related to photos by adding an additional foreign key to the junction table, hence fat junction table.

class TagJunction(BaseModel):
    blog_entry = ForeignKeyField(BlogEntry, null=True)
    photo = ForeignKeyField(Photo, null=True)
    tag = ForeignKeyField(Tag)

Benefits

As stated earlier, having normalized data gives us a lot of flexibility in terms of how we can slice and dice the data.

Drawbacks

May have to add additional joins to retrieve all available content-types.
Adding new content-type foreign keys may require changes to queries elsewhere in the codebase.
Slower to INSERT or DELETE.

Example queries

Get all blog entries and photos associated with the python tag:

query = (TagJunction
         .select(TagJunction, BlogEntry, Photo)
         .join(BlogEntry, JOIN_LEFT_OUTER)
         .switch(TagJunction)
         .join(Photo, JOIN_LEFT_OUTER)
         .switch(TagJunction)
         .join(Tag)
         .where(Tag.tag == 'python')
         .order_by(BlogEntry.title, Photo.filename))

for tag_junction in query:
    if tag_junction.blog_entry.id is not None:
        # We have an entry.
        print 'Entry: %s' % tag_junction.blog_entry.title
    elif tag_junction.photo.id is not None:
        # We have an photo.
        print 'Photo: %s' % tag_junction.photo.filename

Get all tags associated with a blog entry (you can use a similar query for tags associated with photo):

entry_tags = (Tag
              .select()
              .join(TagJunction)
              .where(TagJunction.blog_entry == some_entry)
              .order_by(Tag.tag))

To retrieve the list of tags and the counts of associated entries and photos, you can write:

entry_count = fn.COUNT(BlogEntry.id)
photo_count = fn.COUNT(Photo.id)
tag_count = (Tag
             .select(
                 Tag.tag,
                 entry_count.alias('entry_count'),
                 photo_count.alias('photo_count'))
             .join(TagJunction)
             .join(BlogEntry, JOIN_LEFT_OUTER)
             .switch(TagJunction)
             .join(Photo, JOIN_LEFT_OUTER)
             .group_by(Tag.tag)
             .order_by(Tag.tag))

for tag in tag_count:
    print '%s, %s entries, %s photos' % (
        tag.tag,
        tag.entry_count,
        tag.photo_count)

Generic Foreign Keys

Generic foreign keys (GFKs) are a concept borrowed from the Django framework. The basic premise is that instead of a single foreign key relating to an object's primary key, you have two columns. One column stores the related object's primary key, as a normal foreign key might, while the other column stores some identifier signifying the type of object being stored. I've written several posts about GFKs in Django if you'd like to read more.

GFKs can apply to tagging applications by providing a means of associating a tag with any other model in the application. Tagging is frequently used as the canonical use-case for GFKs, though they certainly can apply to other situations.

Peewee has a GFK extension module which we will use for this example. For the schema we will add two columns to the Tag model for storing the related object type and id, and a special GFKField for easy access to the related object.

class Tag(BaseModel):
    tag = CharField()
    object_type = CharField()
    object_id = IntegerField()
    object = GFKField('object_type', 'object_id')

    class Meta:
        indexes = (
            (('object_type', 'object_id'), False),
        )

class BlogEntry(BaseModel):
    # Same fields as original model...
    tags = ReverseGFK(Tag, 'object_type', 'object_id')

class Photo(BaseModel):
    # Same fields as original model...
    tags = ReverseGFK(Tag, 'object_type', 'object_id')

Benefits

Simple API for common use-cases.

Drawbacks

Inefficient to query multiple content-types, very easy to accidently incur O(n) query performance.
Denormalized, so foreign key constraints are not enforced.
Requires an internal registry of models for mapping content-type strings to the appropriate model class.
Seen by many as a "hack".

Example queries

Get all blog entries and photos associated with the python tag:

query = (Tag
         .select(Tag, BlogEntry, Photo)
         .join(
             BlogEntry,
             JOIN_LEFT_OUTER,
             on=(
                 (Tag.object_id == BlogEntry.id) &
                 (Tag.object_type == 'blogentry')))
         .switch(Tag)
         .join(
             Photo,
             JOIN_LEFT_OUTER,
             on=(
                 (Tag.object_id == Photo.id) &
                 (Tag.object_type == 'photo')))
         .where(Tag.tag == 'python')
         .order_by(BlogEntry.title, Photo.filename))

for tag in query:
    if tag.blogentry.id is not None:
        # We have an entry.
        print 'Entry: %s' % tag.blogentry.title
    elif tag.photo.id is not None:
        # We have an photo.
        print 'Photo: %s' % tag.photo.filename

Note that this is just about as cumbersome as the many-to-many example. This is because we're trying to be as efficient as possible, grabbing the entries and photos in one go. Of course it would be possible to simplify this by incurring O(n) queries:

query = Tag.select().where(Tag.tag == 'python')

for tag_junction in query:
    if tag.object_type == 'blogentry':
        print 'Entry: %s' % tag.object.title
    elif tag.object_type == 'photo':
        print 'Photo: %s' % tag.object.filename

Get all tags associated with a blog entry (similar query for tags associated with photo):

# We can simply use the `ReverseGFK` to retrieve all tags for an object.
for tag in some_entry.tags:
    print tag.tag

Dead-simple approach (revisited)

An additional benefit to the "dead simple" approach described earlier in this post is that it will work with multiple content-types just as easily as it did when we were only tagging blog entries. The drawback remains that it is difficult to make any determinations about tag usage. For instance, how would you calculate the most popular tags? But depending on your application, you may not need this type of information.

Other ideas

These are certainly not the only ways to tag objects with Peewee. Here is a list of some other ideas:

If you are using Postgres, you could store tags using an ArrayField or even using a JSONField.
You can try using a full-text search index to speed things up when storing tags in a flat CharField, though your mileage may vary. SQLite, MySQL and Postgres all have support for optimized full-text search in one form or another.
What other ideas do you have? Please feel free to leave a comment!

Personally, my favorite approach is to use a many-to-many relationship for storing objects and tags. By having normalized data, I know that I will always be able to write a SQL query to express the data I'm interested in. I don't create blog entries very often and never delete them, so the performance impact is minimal. I also cache the parts of my site that query tags, so the additional joins aren't a problem in practice.

Thanks for taking the time to read this post, I hope you found it interesting. If you have any questions or comments, please leave a comment using the form below or drop me a message.

Comments (0)

Commenting has been closed.

A Tour of Tagging Schemas: Many-to-many, Bitmaps and More

Many-to-many

Benefits

Drawbacks

Example queries

Querying multiple tags using AND and OR

Dead simple approach

Benefits

Drawbacks

Example queries

Getting crazy with bitmaps

Benefits

Drawbacks

Example queries

Tagging multiple content-types

Fat junction table

Benefits

Drawbacks

Example queries

Generic Foreign Keys

Benefits

Drawbacks

Example queries

Dead-simple approach (revisited)

Other ideas

Links

Comments (0)