Meet Scout, a Search Server Powered by SQLite

March 28, 2015 11:03 / peewee python scout search sqlite / 5 comments

In my continuing adventures with SQLite, I had the idea of writing a RESTful search server utilizing SQLite's full-text search extension. You might think of it as a poor man's ElasticSearch.

So what is this project? Well, the idea I had was that instead of building out separate search implementations for my various projects, I would build a single lightweight search service I could use everywhere. I really like SQLite (and have previously blogged about using SQLite's full-text search with Python), and the full-text search extension is quite good, so it didn't require much imagination to take the next leap and expose it as a web-service.

Scout is the resulting project, and I hope you find it interesting! Scout is written in Python and uses the Flask framework to expose the web-service. Scout has a few simple concepts:

Indexes
Documents
Metadata

Indexes and documents are related to each-other in a many-to-many configuration, so a particular document can belong to multiple indexes. An index is simply a logical grouping of documents, for instance blog posts, wiki pages or recipes. A document is just a blob of text content you want to be able to search. For a blog entry, the content might be the paragraphs of text, for a recipe it might be the title and ingredients. Documents can also have arbitrary metadata stored as key/value pairs. As a bonus, you can even perform simple filter operations on the metadata in addition to the full-text search over the content!

Checking out Scout

What follows is a brief introduction to Scout and a tour of the features. If you want to follow along, you can install scout using pip or manually via GitHub. If you install from pip, the dependencies will also be installed automatically:

$ pip install scout
... lots of output ...
Successfully installed scout flask peewee Werkzeug Jinja2 itsdangerous markupsafe
Cleaning up...

The Scout server runs as its own process, so I'll start it up in one terminal, specifying a new database file to use as the search index:

$ scout search_index.db
 * Running on http://127.0.0.1:8000/ (Press CTRL+C to quit)

If I request this URL or pull it up in a browser I get a response indicating that no indexes exist yet:

$ curl localhost:8000/
{
  "indexes": []
}

Scout client

To make things easier for Python developers, Scout also comes with a lightweight client. I'll open up a Python interpreter, import the client, and we'll go over how to index (the verb, not the noun) content and perform searches.

>>> from scout_client import Scout
>>> client = Scout('http://localhost:8000/')

If we look over the methods available on the client, you can get a feel for the type of operations Scout supports:

create_index
delete_index
rename_index
store_document
update_document
delete_document
get_documents
search

Storing some documents

To get started we first need to create an index, which we can do by calling create_index() and passing in a name:

>>> client.create_index('thoughts')
{'documents': [],
 'id': 1,
 'name': 'thoughts',
 'page': 1,
 'pages': 0}

We get a nice dictionary confirming our index was created and telling us that there are currently no documents stored there. Let's take care of that by storing some thoughts.

I've been thinking about UFOs a lot recently, and I also like to think about my cat, Huey.

The store_document method accepts the following parameters:

content, the content we wish to store.
indexes, the name or names of the index(es) to add this document to.
metadata (optional), key/value pairs.

When we store a new document, we'll get a nice dictionary back indicating what was stored and giving us the id of the new Document:

>>> client.store_document(
...     ('The Rendlesham forest incident is one of the '
...      'most interesting UFO accounts.'),
...     ['thoughts'],
...     type='ufo')

{u'content': u'The Rendlesham forest incident is one of the most interesting UFO accounts.',
 u'id': 1,
 u'indexes': [u'thoughts'],
 u'metadata': {u'type': u'ufo'}}

Let's store a few more thoughts. I've added the following to my search index:

Huey is not very interested in UFOs., type='huey'.
Sometimes I wonder if huey is an alien., type='huey'.
The Chicago O'Hare UFO incident is also intriguing., type='ufo'.
The evidence points to UFOs being a physical phenomenon., type='ufo'.

Now that we have five documents in the index, let's perform some searches on the content.

Searching for UFOs

Let's see what happens when we search for all documents containing the word UFO:

>>> client.search('thoughts', 'ufo')

{u'documents': [
  {u'content': u'The Rendlesham forest incident is one of the most interesting UFO accounts.',
   u'id': 1,
   u'indexes': [u'thoughts'],
   u'metadata': {u'type': u'ufo'},
   u'score': 0.25},
  {u'content': u'Huey is not very interested in UFOs.',
   u'id': 2,
   u'indexes': [u'thoughts'],
   u'metadata': {u'type': u'huey'},
   u'score': 0.25},
  {u'content': u"The Chicago O'Hare UFO incident is also intriguing.",
   u'id': 4,
   u'indexes': [u'thoughts'],
   u'metadata': {u'type': u'ufo'},
   u'score': 0.25},
  {u'content': u'The evidence points to UFOs being a physical phenomenon.',
   u'id': 5,
   u'indexes': [u'thoughts'],
   u'metadata': {u'type': u'ufo'},
   u'score': 0.25}
 ],
 u'page': 1,
 u'pages': 1}

Scout returns a paginated list of matching documents (50 results per-page, by default). Each search result contains the document's content, id, index(es), metadata, and a score field ranking the quality of the match.

It is also possible to perform additional filtering based on metadata values. In the following example we will again query for ufo, but this time we'll also restrict the results to documents whose type='ufo':

>>> client.search('thoughts', 'ufo', type='ufo')['documents']

[{u'content': u'The Rendlesham forest incident is one of the most interesting UFO accounts.',
  u'id': 1,
  u'indexes': [u'thoughts'],
  u'metadata': {u'type': u'ufo'},
  u'score': 0.25},
 {u'content': u"The Chicago O'Hare UFO incident is also intriguing.",
  u'id': 4,
  u'indexes': [u'thoughts'],
  u'metadata': {u'type': u'ufo'},
  u'score': 0.25},
 {u'content': u'The evidence points to UFOs being a physical phenomenon.',
  u'id': 5,
  u'indexes': [u'thoughts'],
  u'metadata': {u'type': u'ufo'},
  u'score': 0.25}]

Stemming

Scout configures the search index to use the Porter stemming algorithm by default. This means that words are truncated to their simpler root, so even though we indexed the words interesting and interested, look what happens when we search for interest:

>>> results = client.search('thoughts', 'interest')
>>> print results['documents']

[{u'content': u'The Rendlesham forest incident is one of the most interesting UFO accounts.',
  u'id': 1,
  u'indexes': [u'thoughts'],
  u'metadata': {u'type': u'ufo'},
  u'score': 0.5},
 {u'content': u'Huey is not very interested in UFOs.',
  u'id': 2,
  u'indexes': [u'thoughts'],
  u'metadata': {u'type': u'huey'},
  u'score': 0.5}]

Scoring

Note that the score is 0.5 for both documents. One of the interesting limitations of the FTS extension is that it does not provide an algorithm for ranking by relevance. Happily, SQLite allows us to define our own functions in Python, so Scout comes with two ranking algorithms: simple (described here) and bm25.

By default Scout will use the simple ranking algorithm, but you can specify the bm25 algorithm, which gives slightly different results:

>>> results = client.search('thoughts', 'interest', ranking='bm25')
>>> for document in results['documents']:
...     print document['content'][:20], document['score']
...

Huey is not very int 0.370119460283
The Rendlesham fores 0.296095568227

SQLite search queries

SQLite's full-text search engine supports an impressive variety of query types, which can be used when querying Scout:

Prefix searches: doc* would match both document and doctor.
First token search (FTS4 only): ^peewee would match documents that begin with the token peewee.
Quoted phrases: "sql* data*" would match both sqlite database and sql datatype.
NEAR queries: sqlite NEAR/5 search would match documents where the tokens sqlite and search are within 5 words of eachother.
Set operations using AND, OR and NOT: huey OR ufos would match documents containing either huey or ufo.

A quick note on the name of the project

As folks who follow my blog may have noticed, I like naming my projects after my pets. Scout was our family dog and he was the best dog I've ever known (sorry, Mickey). He was very clever and got into a lot of mischief. One time he ate a bar of decorative soap. Scout was named for the character in To Kill a Mockingbird.

Thanks for reading

Thanks for taking the time to read this post, I hope you found it interesting. SQLite is an amazing library and the full-text search extension works very well. A neat bonus of using SQLite is that our search index is stored in a single, easily transportable file.

If you'd like to learn more about Scout, check out the documentation. The code is available on GitHub and can also be installed using pip.

If you found this post interesting, you might also enjoy these:

Comments (5)

Gareth Bale | apr 03 2015, at 03:46pm

Nice Post

Anonymous | mar 31 2015, at 01:43pm

There is more dogs than cats, I am disapointed.

Anonymous | mar 30 2015, at 02:41am

I enjoy reading your blog posts. Thank you

Charlie | mar 29 2015, at 12:48am

Thanks, Evan! I've used Solr in the past for a handful of projects and was really impressed with it. More recently I did some experimenting with ElasticSearch and I find it much more flexible than Solr. My thought with Scout was that for most projects I just need good, reliable full-text search with a sane, readable query format. I also really like trying to come up with fun ways to use SQLite, so this project was quite fun for me. Hope you find it helpful, contact me if you have any questions.

Evan | mar 29 2015, at 12:23am

This is pretty neat! I have actually been doing a lot of work with Solr lately, so it crossed by mind a few times that I should try and build a search index for fun, albeit probably not in Python or Java. I will definitely be taking a closer look at this though to try and glean some insight on the process!

Commenting has been closed.