Saturday morning hack: personalized news digest with boolean query parser

May 10, 2014 10:28 / flask peewee python saturday-morning-hacks / 3 comments

Because I had so much fun writing my last Saturday morning hack, I thought I would share another little hack. I was thinking that I really enjoy my subscription to Python weekly and wouldn't it be great if I had a personal email digest containing just the types of things that interest me? I regularly cruise reddit and ~~hacker~~ hater news but in my opinion there's a pretty low signal-to-noise ratio. Occasionally I stumble on fascinating content and that's what keeps me coming back.

I wanted to write an app that would search the sites I read and automatically create an email digest based on search terms that I specified. I recently swapped my blog over to SQLite and I love that the SQLite full-text search extension lets you specify boolean queries. With that in mind, I decided that I would have a curated list of boolean search queries which would be used to filter content from the various sites I read. Any articles that match my search would then be emailed to me.

Here are some of my search terms, which I am viewing in the flask-peewee admin interface:

Implementation

Here is the bill of materials:

Huey task queue, which will be used to periodically check for new articles and email them to me. If you want you could just write a cronjob, of course.
Peewee ORM for storing the search terms and the links, so I don't send myself the same link twice.

The first thing I did was define my database models. As you see, they are very simple:

from peewee import *

class BaseModel(Model):
    class Meta:
        database = SqliteDatabase('search_terms.db')

class SearchTerm(BaseModel):
    phrase = CharField()

    def __unicode__(self):
        return self.phrase

    def parse(self):
        # TODO: we will add this method in a bit.
        pass

    def test(self, title):
        # TODO: we will add this method in a bit.
        pass

class SavedLink(BaseModel):
    title = CharField(max_length=255)
    url = CharField(index=True)
    timestamp = DateTimeField(default=datetime.datetime.now)

Parsing boolean expressions

Now we come to the interesting part: the boolean expression parser. If you looked closely at the screenshot included above, you can see a few SearchTerm objects I've specified. Here are the actual phrases:

peewee AND (orm OR python)
gevent
flask
python "-cheat sheet" -beginner -intro -pypy -free -"python 3"
sqlite OR unqlite OR berkeleydb OR kyotocabinet OR tokyocabinet

Search terms consist of phrases joined by conjunctions (AND and OR). Search terms can be negated, indicated by prefixing the term with a dash (-) character. Search terms can also be nested, as in the first example, which will match articles containing the word peewee and also orm or python.

The first step is to split the search term into one or more tokens, which can be accomplished with a short regex. The regex splits on quoted text, parentheses and whitespace:

'(\".+?\"|\(|\s+|\))'

The parsing code maintains a stack of Node objects which comprise the query tree. The Node class is very basic, and there are two subclasses which are used to represent AND and OR nodes:

class Node(object):
    def __init__(self, text=''):
        self.text = text
        self.children = []

    def __repr__(self):
        if not self.children:
            return self.text
        else:
            return '%s(%s)' % (
                self.text,
                ' '.join(str(node) for node in self.children))

class AndNode(Node):
    pass

class OrNode(Node):
    pass

Putting this together, here is the SearchTerm.parse method, which splits the phrase into tokens, then builds a tree of node objects.

def parse(self):
    node_map = {'AND': AndNode, 'OR': OrNode}

    # The default conjunction is AND, so if the phrase simply consists of
    # one or more words, "and"- them together.
    stack = [AndNode('AND')]

    for token in re.split('(\".+?\"|\(|\s+|\))', self.phrase):
        # Ignore any empty tokens.
        token = token.strip()
        if not token:
            continue

        if token == '(':
            stack.append(Node())
        elif token == ')':
            node = stack.pop()
            if len(stack) == 0:
                raise ValueError('Error: unbalanced parentheses!')
            stack[-1].children.append(node)
        elif token in ('AND', 'OR'):
            # In the event the top node uses the same conjunction, just
            # continue on. Otherwise push a new node onto the stack.
            if stack[-1].text != token:
                top = stack.pop()
                node = node_map[token](token)
                if top.text == '' and len(top.children) == 1:
                    node.children.extend(top.children)
                else:
                    node.children.append(top)
                stack.append(node)
        else:
            node = Node(token.strip('"').lower())
            stack[-1].children.append(node)

    if len(stack) != 1:
        raise ValueError('Error: unbalanced parentheses!')

    return stack[0]

If you try this out in the console, you'll get a nice representation of the search terms:

>>> search_term = SearchTerm(phrase='peewee AND (python OR orm)')
>>> search_term.parse()
AND(peewee OR(python orm))

Compiling the search query

Now that we have a query tree, we need to compile it into executable Python code. I struggled with this for a bit, but then I settled on what I hope is a fairly idiomatic implementation -- I just added a code() method to the Node classes which is responsible for recursively generating the list of query conditions.

For text nodes, the question is simply "Does the given title contain the phrase we're interested in?". For AND and OR, we just need to know whether all or any of the child nodes meet their conditions, respectively.

Here are the updated implementations of the Node classes, which now contain a code() method:

class Node(object):
    def __init__(self, text=''):
        self.text = text
        self.children = []

    def __repr__(self):
        if not self.children:
            return self.text
        else:
            return '%s(%s)' % (
                self.text,
                ' '.join(str(node) for node in self.children))

    def code(self):
        if self.text.startswith('-'):
            return lambda s: self.text[1:] not in s
        else:
            return lambda s: self.text in s

class AndNode(Node):
    def code(self):
        def fn(s):
            return all([
                child.code()(s)
                for child in self.children])
        return fn

class OrNode(Node):
    def code(self):
        def fn(s):
            return any([
                child.code()(s)
                for child in self.children])
        return fn

Now that we have a way of converting the query tree into python code, we can fill in the SearchTerm.test() method, which will test whether the title matches the specified SearchTerm.phrase:

class SearchTerm(BaseModel):
    # yadda yadda...

    def parse(self):
        # ...
        return stack[0]

    def test(self, s):
        # Parse the search phrase into a `Node` tree, then generate a python
        # function corresponding to the query conditions. Finally, apply that
        # function to the incoming string, `s`.
        return self.parse().code()(s)

Here's how to test strings against our search terms to see whether they match. Let's check it out in the interactive interpreter:

>>> from playground.models import *
>>> st = SearchTerm(phrase='peewee AND (python OR orm)')
>>> st.test('peewee')
False
>>> st.test('peewee is written with python')
True
>>> st.test('an orm named peewee')
True
>>> st.test('an orm written with python')
False

Extracting links from websites

The final piece is the code that extracts links from websites. This code needs to retrieve a list of links (consisting of a title and URL) from a specified URL. The catch is that some sites use RSS, some use a REST API, and in the future I may even want to scrape HTML, so the implementation needed to be flexible enough to handle all cases.

As I mentioned, the code will be extracting Link objects. Links are really just a title and a URL, so I've implemented them using namedtuple:

Link = namedtuple('Link', ('title', 'url'))

I then created a Searcher class which has a single abstract method extract_links. There are also helper methods for fetching content using HTTP and then testing the link using the SearchTerm model.

import urllib2

class Searcher(object):
    user_agent = 'SaturdayMorningHacks/0.1'

    def __init__(self, url):
        self.url = url

    def fetch(self):
        request = urllib2.Request(self.url)
        request.add_header('User-Agent', self.user_agent)
        return urllib2.urlopen(request).read()

    def search(self, search_terms):
        content = self.fetch()
        for link in self.extract_links(content):
            for search_term in search_terms:
                if search_term.test(link.title.lower()):
                    yield link
                    break

    def extract_links(self, content):
        raise NotImplementedError

I needed two implementations of Searcher, one for searching RSS feeds, the other for accessing Reddit's API. While Reddit does have RSS feeds, the links just take you to the comments page for the post, so I opted to use their API which exposes the actual content URL. For simplicity, I used the standard library ElementTree to parse the RSS.

import json
from xml.etree import ElementTree as etree

class FeedSearcher(Searcher):
    def extract_links(self, content):
        tree = etree.fromstring(content)
        articles = tree.findall('channel/item')
        for article in articles:
            yield Link(
                title=article.findtext('title'),
                url=article.findtext('link'))


class RedditSearcher(Searcher):
    def extract_links(self, content):
        content = json.loads(content)
        links = content['data']['children']
        for link in links:
            link_data = link['data']
            if link_data['selftext'] == '':
                yield Link(
                    title=link_data['title'],
                    url=link_data['url'])

Now that I have my Searcher implementations, I can create a list containing searchers for the sites I am interested in:

SEARCHERS = [
    FeedSearcher('https://news.ycombinator.com/rss'),
    RedditSearcher('http://www.reddit.com/r/programming/hot.json'),
    RedditSearcher('http://www.reddit.com/r/Python/hot.json'),
]

Tying up loose ends

Lastly we need to set up a huey task to fetch the list of articles and generate an email digest. Huey is a multi-threaded task queue that plays nicely with Redis, but if you're just experimenting you can use the SQLite backend (or just run a script as a cronjob).

@huey.periodic_task(crontab(minute='0', hour='0'))
def content_search():
    logger = app.logger
    query = SearchTerm.select()
    links = []

    for searcher in SEARCHERS:
        try:
            results = searcher.search(query)
        except:
            logger.exception('Error fetching %s', searcher.url)
        else:
            for result in results:
                exists = (SavedLink.select()
                          .where(SavedLink.url == result.url).exists())
                if not exists:
                    SavedLink.create(title=link.title, url=link.url)
                    links.append(result)

    if links:
        digest = '\n'.join(
            '"%s"  ->  %s' % (link.title, link.url)
            for link in links)
        send_email(my_email_address, 'Article digest', digest)

And that's all there is to it! While testing I got my first issue of my new Article digest and was very pleased with the results.

Thanks for reading

Thanks for reading this post, I hope you found it interesting. I had a lot of fun working on this project and am looking forward to tweaking my SearchTerms to achieve a really great stream of new content. Look for more posts in the future tagged saturday-morning-hacks -- I'm really liking this format.

If you're interested in more projects like this, check out the saturday-morning hack posts.

Comments (3)

Mike | may 12 2014, at 09:02am

Just wanted to say that I've really been enjoying this series. Great post, looking forward to the next one!

Charles Leifer | may 11 2014, at 09:59pm

Thanks! I've integrated these features into a personal playground site (here's a post about it), which is set up to be deployed with fabric. I'm running on AWS, so I just have fabric commands to pull down the code, update pip requirements, and restart the server and task queue processes (which I manage with supervisor).

Ben Hughes | may 11 2014, at 06:01am

Love this series Charles - on this, what tools (if any) do you use to run/deploy these quick tools onto a server?

Commenting has been closed.