November 01, 2010 21:07 / 0 comments / django haystack search solr

Users of djangosnippets.org may have noticed the addition of a few search-related features over the past several months. I'd like to highlight some of the additions that have been made and show how you can implement similar functionality on your sites. All of djangosnippet's search leans on Apache Solr, a powerful search engine built on top of Apache Lucene. Haystack is the search solution for Django apps - it provides a querying interface similar to Django's ORM, handles indexing your models for you, and supports advanced features like "more-like-this" and faceting.

Getting set up (angle brackets, anyone?)

I've actually written another post on setting up multi-core Solr on Ubuntu 10.04. I got a bit of flak for using tomcat6 as the server - you can definitely go with jetty instead. Jetty is bundled with Solr, check out the examples/README.txt to get started quickly. You might find the following links useful:

When setting up search with haystack, there are two important configuration files to be aware of:

schema.xml

The Solr schema is only superficially analagous to a database schema (if your database was just one big freaking table). It does a whole lot more than a database schema, allowing you to configure how individual fields are tokenized, filtered, stored, and searched. There is a high degree of configurability, so if your needs go beyond a basic "site search" I'd recommend Solr 1.4 Enterprise Search Server - I'm only 5 chapters in and it's already pretty much blown my mind. Luckily, haystack will generate this file automatically, allowing you to get up and running quickly.

solrconfig.xml

I have not gone too deep into this file, but it is where you can configure things like caching, more-like-this support, spell check, and highlighting. It also gives you a whole bunch of knobs for configuring the inner-workings of the indexing and querying facilities.

a final word on getting solr running

The Seven Deadly Sins of Solr

I am still very much a n00b when it comes to Search and am probably doing more than a few things wrong. Any helpful suggestions would be appreciated!

(in fact, the search engine for djangosnippets is running on a 10-year-old pentium iii laptop. the three hours last week where search was down? I was rearranging my room)

Site Search

The first search-related feature I'll discuss is the site search. The first step was getting haystack installed and creating a SearchIndex for the snippets. The SearchIndex usually mirrors the models.py to some extent, although if you plan on indexing more than a couple models you may want to pick some field-naming conventions to keep the number of different fields in your Solr index small.

from haystack.indexes import *
from haystack import site
from cab.models import Snippet

class SnippetIndex(SearchIndex):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='author__username')
    title = CharField(model_attr='title')
    tags = CharField()
    tag_list = MultiValueField()
    language = CharField(model_attr='language__name')
    pub_date = DateTimeField(model_attr='pub_date')
    django_version = FloatField(model_attr='django_version')
    bookmark_count = IntegerField(model_attr='bookmark_count')
    rating_score = IntegerField(model_attr='rating_score')
    url = CharField(indexed=False)

    def prepare_tags(self, obj):
        return ' '.join([tag.name for tag in obj.tags.all()])

    def prepare_tag_list(self, obj):
        return [tag.name for tag in obj.tags.all()]

    def prepare_url(self, obj):
        return obj.get_absolute_url()

    def get_updated_field(self):
        return 'updated_date'

site.register(Snippet, SnippetIndex)

There's a lot of stuff in there, but the two more interesting bits are the first and last fields. The text field is the default search field and is generated by rendering the search/indexes/cab/snippet_text.txt template. Peeking at the schema.xml, this field is getting tokenized and filtered:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

...

<field name="text" type="text" indexed="true" stored="true" multiValued="false" />

This field is the heart of the index and is queried whenever a field is not explicitly specified. Check out Analyzers, tokenizers and filters if you're intersted in reading up on what these various bits do.

The last field is unique because, as you can see in the field definition, it is setting indexed=False. This yields the following line in the autogenerated schema.xml:

<field name="url" type="string" indexed="false" stored="true" multiValue="false" />

Because the field is not indexed it cannot be queried directly, but it will be returned as a part of the search results, effectively allowing me to save a database query when generating a link.

Indexing and Searching

Once the schema and search_indexes.py are in place, I can index all the snippets in the database by running django-admin.py rebuild_index. When I want to update the index, I run django-admin.py update_index --age=[age in hours]. To get closer to real-time results try the Real-Time SearchIndex that comes with haystack.

A basic search view can lean on haystack's default. Here is the line from my urlconf:

url(r'^search/$', 'haystack.views.basic_search', name='cab_search'),

Advanced Search

To get advanced search going, I subclassed SearchForm, added the fields I needed and then basically did a shitload of filtering.

class AdvancedSearchForm(SearchForm):
    language = forms.ModelChoiceField(queryset=Language.objects.all(), required=False)
    django_version = forms.MultipleChoiceField(choices=DJANGO_VERSIONS, required=False)
    minimum_pub_date = forms.DateTimeField(widget=admin.widgets.AdminDateWidget,
        required=False)
    minimum_bookmark_count = forms.IntegerField(required=False)
    minimum_rating_score = forms.IntegerField(required=False)

    def search(self):
        # First, store the SearchQuerySet received from other processing.
        sqs = super(AdvancedSearchForm, self).search()

        if self.cleaned_data['language']:
            sqs = sqs.filter(language=self.cleaned_data['language'].name)

        if self.cleaned_data['django_version']:
            sqs = sqs.filter(django_version__in=self.cleaned_data['django_version'])

        if self.cleaned_data['minimum_pub_date']:
            sqs = sqs.filter(pub_date__gte=self.cleaned_data['minimum_pub_date'])

        if self.cleaned_data['minimum_bookmark_count']:
            sqs = sqs.filter(bookmark_count__gte=self.cleaned_data['minimum_bookmark_count'])

        if self.cleaned_data['minimum_rating_score']:
            sqs = sqs.filter(rating_score__gte=self.cleaned_data['minimum_rating_score'])

        return sqs

The relevant line in the urlconf looks like:

from haystack.views import SearchView, search_view_factory

from cab.forms import AdvancedSearchForm

url(r'^search/advanced/$', search_view_factory(
    view_class=SearchView,
    template='search/advanced_search.html',
    form_class=AdvancedSearchForm
), name='cab_search_advanced'),

More like this

Snippets MLT

Haystack provides support for more-like-this out of the box. I needed to add one line to my solrconfig.xml to enable MLT:

<requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />

Make sure that line is present (or uncommented) and you're good to go. I wrote a short filter for use in the template:

@register.filter
def more_like_this(snippet, limit=None):
    sqs = SearchQuerySet().more_like_this(snippet)
    if limit is not None:
        sqs = sqs[:limit]
    return sqs

Haystack ships with a templatetag that offers a good deal more options.

This definitely qualifies as low-hanging fruit once you've got the initial pieces in place and can really add a lot of value to your site. One of the problems I often have with djangosnippets is that I get a lot of old content that's been upvoted to hell but there's actually a newer, cooler version out there. MLT is pretty good at finding these newer snippets.

Autocomplete

Arguably, the feature I'm most excited about is Solr's ability to do autocompletion. Out of the box it's possible to do wildcard searches but this approach does not scale. It's better to use the NGram filter, which I've wrapped up as a custom fieldType in my schema.xml:

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100"
  stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="3"
      maxGramSize="15" />
  </analyzer>

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

Then, I declare a title_ngram field and copy in the value of the title field:

<field name="title_ngram" type="ngram" />

...

<copyField source="title" dest="title_ngram" />

To get the results back out, it's just a matter of querying the title_ngram field with the user's partial phrase:

def autocomplete(request):
    q = request.GET.get('q') or ''
    results = []
    if len(q) > 2:
        sqs = SearchQuerySet()
        result_set = sqs.filter(title_ngram=q)[:10]
        for obj in result_set:
            results.append({
                'title': obj.title,
                'author': obj.author,
                'url': obj.url
            })
    return HttpResponse(json.dumps(results), mimetype='application/json')

Conclusion

Hope you found this post informative! There's a ton of interesting things that Solr can do and Haystack provides a nice wrapper around the most common features. As always, any comments, feedback, suggestions, errata, etc are appreciated.

More like this

Comments (0)


Commenting has been closed, but please feel free to contact me