In this post I'll present how I built a (reasonably) powerful autocomplete engine with Redis and python. For those who are not familiar with Redis, it is a fast, in-memory, single-threaded database that is capable of storing structured data (lists, hashes, sets, sorted sets). I chose Redis for this particular project because its sorted set data type, which is a good fit for autocomplete. The engine I'll describe relies heavily on Redis' sorted sets and its set operations, but can easily be translated to a pure-python solution (links at bottom of post).
To be actually useful, the engine needs to support a given set of features and guarantee a reasonable query complexity. Redis' sorted set operations are generally on the order of O(log(N)) (the underlying implementation uses skip lists), so we can satisfy the latter requirement. For the first requirement, here is my list -- the actual features themselves were suggested by the Quora typeahead search puzzle:
In addition to these requirements, I've also added:
The schema I used stores autocomplete information in a series of sorted sets. Rich data is stored in a separate hash-table. An excellent description of the schema can be found on Pat Shaughnessy's blog, and this post is mostly building on his.
The schema looks like this:
To add an item to the index:
Digging a little deeper into steps 1-3, let's assume we are storing a document titled "python code" with a unique id of "1337". We will execute the following Redis commands:
HSET title.key 1337 "python code"(step 1)
HSET data.key 1337 <item data>(step 2)
ZADD search.key.py <score> 1337
ZADD search.key.pyt <score> 1337
ZADD search.key.pyth <score> 1337
ZADD search.key.pytho <score> 1337
ZADD search.key.python <score> 1337
ZADD search.key.co <score> 1337
ZADD search.key.cod <score> 1337
ZADD search.key.code <score> 1337
Here is how I have implemented it in redis-completion. For simplicity, the title and data are optional parameters, though in practice I have always specified them. Additionally an object type may be specified, which comes into play later when I'll talk about boosting.
def store(self, obj_id, title=None, data=None, obj_type=None): pipe = self.client.pipeline() if title is None: title = obj_id if data is None: data = title title_score = self.score_key(self.create_key(title)) obj_id = self.kcombine(obj_id, obj_type or '') pipe.hset(self.data_key, obj_id, data) pipe.hset(self.title_key, obj_id, title) for word in self.clean_phrase(title): for partial_key in self.autocomplete_keys(word): pipe.zadd(self.search_key(partial_key), obj_id, title_score) pipe.execute()
To reiterate, we're storing the rich data (e.g. the full title and some associated data) in hashes keyed by a unique ID for the given object. Then, we're iterating through all the partial strings for the full title. Each of these partial strings will be used as a key for a sorted set. The sorted set stores the unique ID and a score for the title that will ensure results pulled from the set are in alphabetical order.
One interesting part here is scoring the item to ensure lexical ordering. The way I did that was to convert the title from a string to a base-27 number (1 for each letter, 1 for whitespace), which is what "score_key" does.
Searching for a single word is very easy, since the list of IDs is stored in a sorted set where the key is each partial phrase. The workflow looks like this:
Where things get interesting is when we are searching for multiple partial phrases,
for example "pyt" and "co". This is where our schema shines. We can use a special
ZINTERSTORE which will store the intersection of multiple sorted
sets in a new sorted set. From the redis docs, the complexity of this operation is
on the order of O(N*log(N)):
Time complexity: O(NK)+O(Mlog(M)) worst case with N being the smallest input sorted set, K being the number of input sorted sets and M being the number of elements in the resulting sorted set.
We will take each phrase, for example "pyt" and "co", and store the intersection of those two sets in a new sorted set. Redis handles all of this automatically. Additionally, we can set an expiry on this new sorted set so that we ensure our search index doesn't grow out of control, but popular searches will be automatically cached. Pretty sweet.
Below is a simplified version of the search code. It takes an incoming phrase, which may be multiple words, and splits it into groups of lower-cased words. If only a single word is searched for, we can simply look at that one key, but if multiple words are searched we will store the intersection of those sets in a new key and search that:
def search(self, phrase): # split incoming phrase into a list of lower-cased strings cleaned = self.clean_phrase(phrase) if not cleaned: # empty search, return no results return  if len(cleaned) == 1: new_key = self.search_key(cleaned) else: new_key = self.get_cache_key(cleaned) if not self.client.exists(new_key): self.client.zinterstore(new_key, map(self.search_key, cleaned)) self.client.expire(new_key, self.cache_timeout) for raw_id in self.client.zrange(new_key, 0, -1): yield self.client.hget(self.data_key, raw_id)
Things get interesting when we add a few capabilities on top of this basic search function. We'll group these into 3 categories:
All of these things can be implemented in the search function. Mappers and filters are easy...we simply take the data retrieved from the hash and transform it using the various mapper functions. After this, we apply the filters in order and in the event the filter returns False we discard the result. The code might look something like this:
for raw_id in self.client.zrange(new_key, 0, -1): raw_data = self.client.hget(self.data_key, raw_id) if mappers: for m in mappers: raw_data = m(raw_data) if filters: passes = True for f in filters: if not f(raw_data): passes = False break if not passes: continue yield raw_data
Implementing boosting is slightly more involved. We will allow results to be boosted based on either the object_id or the object_type. The way I came up with, given the schema I chose, requires iterating over the entire result set. The general plan, then, is to store the results in a new key and iterate over them. If an object matches either the type or id of a specified boost, multiply the score by that. Then, when we pull results they will be in sorted/boosted order.
Taken all together, the search function from redis-completion looks like this:
def search(self, phrase, limit=None, filters=None, mappers=None, boosts=None): cleaned = self.clean_phrase(phrase) if not cleaned: return  if len(cleaned) == 1 and not boosts: new_key = self.search_key(cleaned) else: new_key = self.get_cache_key(cleaned, boosts) if not self.client.exists(new_key): self.client.zinterstore(new_key, map(self.search_key, cleaned)) self.client.expire(new_key, self.cache_timeout) if boosts: pipe = self.client.pipeline() for raw_id, score in self.client.zrange(new_key, 0, -1, withscores=True): orig_score = score for part in self.ksplit(raw_id): if part and part in boosts: score *= 1 / boosts[part] if orig_score != score: pipe.zadd(new_key, raw_id, score) pipe.execute() ct = 0 data =  for raw_id in self.client.zrange(new_key, 0, -1): raw_data = self.client.hget(self.data_key, raw_id) if not raw_data: continue if mappers: for m in mappers: raw_data = m(raw_data) if filters: passes = True for f in filters: if not f(raw_data): passes = False break if not passes: continue data.append(raw_data) ct += 1 if limit and ct == limit: break return data
Aside from a few helper functions for things like generating lists of keys, scoring keys, and encoding the object id/type, that's all there is to it. It is fast and you can save hits to the database server by storing rich data in Redis. I've packaged this code up in a small project called redis-completion. As promised, I've also got code that will work for a pure python implementation. By implementing a sorted set using a skip list, we can replace Redis with an in-memory python data structure.
The code for the pure python example can be found in this gist.
I hope you've enjoyed reading, please let me know if you have any suggestions for improvmeent, find any bugs, etc!
Commenting has been closed, but please feel free to contact me