Powerful autocomplete with Redis in under 200 lines of Python

July 06, 2012 16:29 / autocomplete python redis / 0 comments

Update

Redis-completion is now deprecated. The autocomplete functionality, along with a number of other features, have been integrated into a new project walrus. Check out the walrus blog announcement for more details.

Original post

In this post I'll present how I built a (reasonably) powerful autocomplete engine with Redis and python. For those who are not familiar with Redis, it is a fast, in-memory, single-threaded database that is capable of storing structured data (lists, hashes, sets, sorted sets). I chose Redis for this particular project because its sorted set data type, which is a good fit for autocomplete. The engine I'll describe relies heavily on Redis' sorted sets and its set operations, but can easily be translated to a pure-python solution (links at bottom of post).

To be actually useful, the engine needs to support a given set of features and guarantee a reasonable query complexity. Redis' sorted set operations are generally on the order of O(log(N)) (the underlying implementation uses skip lists), so we can satisfy the latter requirement. For the first requirement, here is my list -- the actual features themselves were suggested by the Quora typeahead search puzzle:

Search for partial phrases, e.g. "pyt co" might yield python code and configuring python
Store objects with a unique ID and optional type information
Query can have any number of "boosting factors" which act on types or ids

In addition to these requirements, I've also added:

Store and retrieve rich data alongside results (this allows us to avoid hitting the db)
Specify any number of mappers and filters which transform data coming from the data store

A simple schema

The schema I used stores autocomplete information in a series of sorted sets. Rich data is stored in a separate hash-table. An excellent description of the schema can be found on Pat Shaughnessy's blog, and this post is mostly building on his.

The schema looks like this:

Redis schema

To add an item to the index:

Store the full title of the item in a hash, keyed by item's ID (HSET)
Store the associated data for the item in a hash, keyed by the item's ID (HSET)
Break up the item's title into individual words, then break up each word into a series of partial strings. These partial strings serve as the keys to sorted sets. Calculate a score for the item (to ensure lexical ordering) and store the item's ID in the set. (ZADD)

Digging a little deeper into steps 1-3, let's assume we are storing a document titled "python code" with a unique id of "123". We will execute the following Redis commands:

HSET title.key 123 "python code" (step 1)
HSET data.key 123 <item data> (step 2)
ZADD search.key.py <score> 123
ZADD search.key.pyt <score> 123
ZADD search.key.pyth <score> 123
ZADD search.key.pytho <score> 123
ZADD search.key.python <score> 123
ZADD search.key.co <score> 123
ZADD search.key.cod <score> 123
ZADD search.key.code <score> 123

Here is how I have implemented it in redis-completion. For simplicity, the title and data are optional parameters, though in practice I have always specified them. Additionally an object type may be specified, which comes into play later when I'll talk about boosting.

def store(self, obj_id, title=None, data=None, obj_type=None):
    pipe = self.client.pipeline()

    if title is None:
        title = obj_id
    if data is None:
        data = title

    title_score = self.score_key(self.create_key(title))

    obj_id = self.kcombine(obj_id, obj_type or '')

    pipe.hset(self.data_key, obj_id, data)
    pipe.hset(self.title_key, obj_id, title)

    for word in self.clean_phrase(title):
        for partial_key in self.autocomplete_keys(word):
            pipe.zadd(self.search_key(partial_key), obj_id, title_score)

    pipe.execute()

To reiterate, we're storing the rich data (e.g. the full title and some associated data) in hashes keyed by a unique ID for the given object. Then, we're iterating through all the partial strings for the full title. Each of these partial strings will be used as a key for a sorted set. The sorted set stores the unique ID and a score for the title that will ensure results pulled from the set are in alphabetical order.

One interesting part here is scoring the item to ensure lexical ordering. The way I did that was to convert the title from a string to a base-27 number (1 for each letter, 1 for whitespace), which is what "score_key" does.

Searching the index

Searching for a single word is very easy, since the list of IDs is stored in a sorted set where the key is each partial phrase. The workflow looks like this:

Convert the search phrase to a key
Attempt to get a list of IDs at the given key (ZRANGE)
For each ID, return the associated data (HGET)

Where things get interesting is when we are searching for multiple partial phrases, for example "pyt" and "co". This is where our schema shines. We can use a special redis command ZINTERSTORE which will store the intersection of multiple sorted sets in a new sorted set. From the redis docs, the complexity of this operation is on the order of O(N*log(N)):

Time complexity: O(NK)+O(Mlog(M)) worst case with N being the smallest input sorted set, K being the number of input sorted sets and M being the number of elements in the resulting sorted set.

We will take each phrase, for example "pyt" and "co", and store the intersection of those two sets in a new sorted set. Redis handles all of this automatically. Additionally, we can set an expiry on this new sorted set so that we ensure our search index doesn't grow out of control, but popular searches will be automatically cached. Pretty sweet.

Below is a simplified version of the search code. It takes an incoming phrase, which may be multiple words, and splits it into groups of lower-cased words. If only a single word is searched for, we can simply look at that one key, but if multiple words are searched we will store the intersection of those sets in a new key and search that:

def search(self, phrase):
    # split incoming phrase into a list of lower-cased strings
    cleaned = self.clean_phrase(phrase)
    if not cleaned:
        # empty search, return no results
        return []

    if len(cleaned) == 1:
        new_key = self.search_key(cleaned[0])
    else:
        new_key = self.get_cache_key(cleaned)
        if not self.client.exists(new_key):
            self.client.zinterstore(new_key, map(self.search_key, cleaned))
            self.client.expire(new_key, self.cache_timeout)

    for raw_id in self.client.zrange(new_key, 0, -1):
        yield self.client.hget(self.data_key, raw_id)

Adding mappings, filters and boosting

Things get interesting when we add a few capabilities on top of this basic search function. We'll group these into 3 categories:

mappings: act on the data stored in the index, for example converting JSON to python objects
filters: act on the mapped data. Data is passed into the filter functions, and if it passes all filter functions, is returned, otherwise it is discarded.
boosts: allow certain object ids or object types to have their score "boosted", pushing them towards the front of the results (unfortunately O(n)).

All of these things can be implemented in the search function. Mappers and filters are easy...we simply take the data retrieved from the hash and transform it using the various mapper functions. After this, we apply the filters in order and in the event the filter returns False we discard the result. The code might look something like this:

for raw_id in self.client.zrange(new_key, 0, -1):
    raw_data = self.client.hget(self.data_key, raw_id)

    if mappers:
        for m in mappers:
            raw_data = m(raw_data)

    if filters:
        passes = True
        for f in filters:
            if not f(raw_data):
                passes = False
                break

        if not passes:
            continue

    yield raw_data

Implementing boosting is slightly more involved. We will allow results to be boosted based on either the object_id or the object_type. The way I came up with, given the schema I chose, requires iterating over the entire result set. The general plan, then, is to store the results in a new key and iterate over them. If an object matches either the type or id of a specified boost, multiply the score by that. Then, when we pull results they will be in sorted/boosted order.

Taken all together, the search function from redis-completion looks like this:

def search(self, phrase, limit=None, filters=None, mappers=None, boosts=None):
    cleaned = self.clean_phrase(phrase)
    if not cleaned:
        return []

    if len(cleaned) == 1 and not boosts:
        new_key = self.search_key(cleaned[0])
    else:
        new_key = self.get_cache_key(cleaned, boosts)
        if not self.client.exists(new_key):
            self.client.zinterstore(new_key, map(self.search_key, cleaned))
            self.client.expire(new_key, self.cache_timeout)

    if boosts:
        pipe = self.client.pipeline()
        for raw_id, score in self.client.zrange(new_key, 0, -1, withscores=True):
            orig_score = score
            for part in self.ksplit(raw_id):
                if part and part in boosts:
                    score *= 1 / boosts[part]
            if orig_score != score:
                pipe.zadd(new_key, raw_id, score)
        pipe.execute()

    ct = 0
    data = []

    for raw_id in self.client.zrange(new_key, 0, -1):
        raw_data = self.client.hget(self.data_key, raw_id)
        if not raw_data:
            continue

        if mappers:
            for m in mappers:
                raw_data = m(raw_data)

        if filters:
            passes = True
            for f in filters:
                if not f(raw_data):
                    passes = False
                    break

            if not passes:
                continue

        data.append(raw_data)
        ct += 1
        if limit and ct == limit:
            break

    return data

That's all there is to it!

Aside from a few helper functions for things like generating lists of keys, scoring keys, and encoding the object id/type, that's all there is to it. It is fast and you can save hits to the database server by storing rich data in Redis. I've packaged this code up in a small project called redis-completion. As promised, I've also got code that will work for a pure python implementation. By implementing a sorted set using a skip list, we can replace Redis with an in-memory python data structure.

The code for the pure python example can be found in this gist.

I hope you've enjoyed reading, please let me know if you have any suggestions for improvmeent, find any bugs, etc!

Comments (0)

Commenting has been closed.