Search on

Users of may have noticed the addition of a few search-related features over the past several months. I'd like to highlight some of the additions that have been made and show how you can implement similar functionality on your sites. All of djangosnippet's search leans on Apache Solr, a powerful search engine built on top of Apache Lucene. Haystack is the search solution for Django apps - it provides a querying interface similar to Django's ORM, handles indexing your models for you, and supports advanced features like more-like-this and faceting.

Getting set up

The first step is to get Solr running. Jetty, a Java app server, is bundled with Solr and the examples/README.txt contains instructions for getting up and running. The following links contain more information about installing Solr:

When setting up search with haystack, there are two important configuration files to be aware of:


The Solr schema is only superficially analagous to a database schema (if your database was just one big table). It does a more than a database schema, allowing you to configure how individual fields are tokenized, filtered, stored, and searched. There is a high degree of configurability, so if your needs go beyond a basic site search I'd recommend Solr 1.4 Enterprise Search Server - I'm only 5 chapters in and it's already pretty much blown my mind. Luckily, haystack will generate this file automatically, allowing you to get up and running quickly.


I have not gone too deep into this file, but it is where you can configure things like caching, more-like-this support, spell check, and highlighting. It also gives you a whole bunch of knobs for configuring the inner-workings of the indexing and querying facilities.

A warning

The Seven Deadly Sins of Solr

I am still learning and am probably doing more than a few things clumsily if not altogether wrongly. Any helpful suggestions would be appreciated!

(In fact, the search engine for djangosnippets is running on a 10-year-old pentium iii laptop. The three hours last week where search was down? I was rearranging my room.)

The first search-related feature I'll discuss is the site search. The first step was getting haystack installed and creating a SearchIndex for the snippets. The SearchIndex usually mirrors the to some extent, although if you plan on indexing more than a couple models you may want to pick some field-naming conventions to keep the number of different fields in your Solr index small.

from haystack.indexes import *
from haystack import site
from cab.models import Snippet

class SnippetIndex(SearchIndex):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='author__username')
    title = CharField(model_attr='title')
    tags = CharField()
    tag_list = MultiValueField()
    language = CharField(model_attr='language__name')
    pub_date = DateTimeField(model_attr='pub_date')
    django_version = FloatField(model_attr='django_version')
    bookmark_count = IntegerField(model_attr='bookmark_count')
    rating_score = IntegerField(model_attr='rating_score')
    url = CharField(indexed=False)

    def prepare_tags(self, obj):
        return ' '.join([ for tag in obj.tags.all()])

    def prepare_tag_list(self, obj):
        return [ for tag in obj.tags.all()]

    def prepare_url(self, obj):
        return obj.get_absolute_url()

    def get_updated_field(self):
        return 'updated_date'

site.register(Snippet, SnippetIndex)

There's a lot of stuff in there, but the two more interesting bits are the first and last fields. The text field is the default search field and is generated by rendering the search/indexes/cab/snippet_text.txt template. Peeking at the schema.xml, this field is getting tokenized and filtered:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>


<field name="text" type="text" indexed="true" stored="true" multiValued="false" />

This field is the heart of the index and is queried whenever a field is not explicitly specified. Check out Analyzers, tokenizers and filters if you're intersted in reading up on what these various bits do.

The last field is unique because, as you can see in the field definition, it is setting indexed=False. This yields the following line in the autogenerated schema.xml:

<field name="url" type="string" indexed="false" stored="true" multiValue="false" />

Because the field is not indexed it cannot be queried directly, but it will be returned as a part of the search results, effectively allowing me to save a database query when generating a link.

Indexing and Searching

Once the schema and are in place, I can index all the snippets in the database by running rebuild_index. When I want to update the index, I run update_index --age=[age in hours]. To get closer to real-time results try the Real-Time SearchIndex that comes with haystack.

A basic search view can lean on haystack's default. Here is the line from my urlconf:

url(r'^search/$', 'haystack.views.basic_search', name='cab_search'),

To get advanced search going, I subclassed SearchForm, added the fields I needed and then basically did a shitload of filtering.

class AdvancedSearchForm(SearchForm):
    language = forms.ModelChoiceField(queryset=Language.objects.all(), required=False)
    django_version = forms.MultipleChoiceField(choices=DJANGO_VERSIONS, required=False)
    minimum_pub_date = forms.DateTimeField(widget=admin.widgets.AdminDateWidget,
    minimum_bookmark_count = forms.IntegerField(required=False)
    minimum_rating_score = forms.IntegerField(required=False)

    def search(self):
        # First, store the SearchQuerySet received from other processing.
        sqs = super(AdvancedSearchForm, self).search()

        if self.cleaned_data['language']:
            sqs = sqs.filter(language=self.cleaned_data['language'].name)

        if self.cleaned_data['django_version']:
            sqs = sqs.filter(django_version__in=self.cleaned_data['django_version'])

        if self.cleaned_data['minimum_pub_date']:
            sqs = sqs.filter(pub_date__gte=self.cleaned_data['minimum_pub_date'])

        if self.cleaned_data['minimum_bookmark_count']:
            sqs = sqs.filter(bookmark_count__gte=self.cleaned_data['minimum_bookmark_count'])

        if self.cleaned_data['minimum_rating_score']:
            sqs = sqs.filter(rating_score__gte=self.cleaned_data['minimum_rating_score'])

        return sqs

The relevant line in the urlconf looks like:

    from haystack.views import SearchView, search_view_factory

    from cab.forms import AdvancedSearchForm

    url(r'^search/advanced/$', search_view_factory(
    ), name='cab_search_advanced'),

More like this

Snippets MLT

Haystack provides support for more-like-this out of the box. I needed to add one line to my solrconfig.xml to enable MLT:

<requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />

Make sure that line is present (or uncommented) and you're good to go. I wrote a short filter for use in the template:

def more_like_this(snippet, limit=None):
    sqs = SearchQuerySet().more_like_this(snippet)
    if limit is not None:
        sqs = sqs[:limit]
    return sqs

Haystack ships with a templatetag that offers a good deal more options.

This definitely qualifies as low-hanging fruit once you've got the initial pieces in place and can really add a lot of value to your site. One of the problems I often have with djangosnippets is that I get a lot of old content that's been upvoted to hell but there's actually a newer, cooler version out there. MLT is pretty good at finding these newer snippets.


Arguably, the feature I'm most excited about is Solr's ability to do autocompletion. Out of the box it's possible to do wildcard searches but this approach does not scale. It's better to use the NGram filter, which I've wrapped up as a custom fieldType in my schema.xml:

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100"
  stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="3"
      maxGramSize="15" />

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />

Then, I declare a title_ngram field and copy in the value of the title field:

<field name="title_ngram" type="ngram" />


<copyField source="title" dest="title_ngram" />

To get the results back out, it's just a matter of querying the title_ngram field with the user's partial phrase:

def autocomplete(request):
    q = request.GET.get('q') or ''
    results = []
    if len(q) > 2:
        sqs = SearchQuerySet()
        result_set = sqs.filter(title_ngram=q)[:10]
        for obj in result_set:
                'title': obj.title,
                'url': obj.url
    return HttpResponse(json.dumps(results), mimetype='application/json')


Hope you found this post informative! There's a ton of interesting things that Solr can do and Haystack provides a nice wrapper around the most common features. As always, any comments, feedback, suggestions, errata, etc are appreciated.

Comments (0)

Commenting has been closed, but please feel free to contact me