Saturday morning hacks: Building an Analytics App with Flask

Saturday morning hacks

A couple years back I wrote about building an Analytics service with Cassandra. As fun as that project was to build, the reality was that Cassandra was completely unsuitable for my actual needs, so I decided to switch to something simpler. I'm happy to say the replacement app has been running without a hitch for the past 5 months taking up only about 20 MB of RAM! In this post I'll show how to build a lightweight Analytics service using Flask.

Analytics request/response cycle

The analytics service we'll be building will follow a blueprint popularized by Google Analytics. Here's how it works:

Here is a diagram of the requests and responses:

Analytics Flow

Design considerations

Since this is running on a VPS with limited resources, and because my blog doesn't really receive that much traffic, we'll go with something lightweight and functional. I like the Flask framework for projects of all sizes, but it should work really well for this particular app. We'll also use peewee ORM for storing the page-views and, later in this post, running queries against our analytics data. All told our app will be less than 100 lines of code including comments!

Relational Database

In order to be able to easily run lots of ad-hoc queries, we'll use a relational database to store the page-view data. I chose to use BerkeleyDB's SQLite interface because it is a lightweight embedded database, and won't take up too much RAM. I could have also chosen SQLite, but BerkeleyDB provides much higher concurrent transaction throughput than SQLite does. In the event the analytics site gets hammered, it should be able to keep up with the traffic:

p1405279996.65.png

If you're already running Postgresql or MySQL, then feel free to use them instead.

WSGI Server

There are a lot of options to choose from, but my preference is to use gevent. Gevent is a coroutine-based networking library that mixes lightweight threads (greenlets) with libev's event loop. Through the use of some pretty deep monkey-patching, gevent turns your normal, blocking python code into non-blocking without any special syntax or APIs (just one big monkeypatch). Gevent's WSGI server, while pretty basic, provides solid performance with very low overhead. As with the database, if you're already running something else or are familiar with a different library, feel free to use that instead.

Creating the virtualenv

Begin by creating a new virtualenv for the analytics app and installing flask and peewee (and optionally, gevent):

$ virtualenv analytics
New python executable in analytics/bin/python2
Also creating executable in analytics/bin/python
Installing setuptools, pip...done.

$ cd analytics/
$ source bin/activate
$ pip install flask peewee
...
...
Successfully installed flask peewee Werkzeug Jinja2 itsdangerous markupsafe
Cleaning up...

$ pip install gevent  #  Optional.

If you'd like to compile the Python SQLite driver with support for BerkeleyDB, check out the berkeley_build.sh script in the playhouse module (lib/python2.7/site-packages/playhouse/berkeley_build.sh). This script will fetch and compile BerkeleyDB, then compile pysqlite against the BerkeleyDB sqlite libraries. For more detailed instructions, check out this post. You can also skip this step and just use peewee's standard SqliteDatabase class.

Implementing the Flask App

Let's start by creating the skeleton of our Flask app. As discussed, there will be two views: one to serve the JavaScript file, and one to serve the 1-pixel GIF. In the analytics directory, create a new file analytics.py and add the following lines of code. This code specifies the boilerplate for our application as well as some configuration values:

from base64 import b64decode
import datetime
import json
import os
from urlparse import parse_qsl, urlparse

from flask import Flask, Response, abort, request
from peewee import *
from playhouse.berkeleydb import BerkeleyDatabase  # Optional.


# 1 pixel GIF, base64-encoded.
BEACON = b64decode('R0lGODlhAQABAIAAANvf7wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==')

# Store the database file in the app directory.
APP_DIR = os.path.dirname(__file__)
DATABASE_NAME = os.path.join(APP_DIR, 'analytics.db')
DOMAIN = 'http://127.0.0.1:5000'  # TODO: change me.

# Simple JavaScript which will be included and executed on the client-side.
JAVASCRIPT = '' # TODO: add javascript implementation.

# Flask application settings.
DEBUG = bool(os.environ.get('DEBUG'))
SECRET_KEY = 'secret - change me'  # TODO: change me.

app = Flask(__name__)
app.config.from_object(__name__)

database = BerkeleyDatabase(DATABASE_NAME)  # or SqliteDatabase(DATABASE_NAME)

class PageView(Model):
    # TODO: add model definition.

    class Meta:
        database = database

@app.route('/a.gif')
def analyze():
    # TODO: implement 1pixel gif view.

@app.route('/a.js')
def script():
    # TODO: implement javascript view.

@app.errorhandler(404)
def not_found(e):
    return Response('Not found.')

if __name__ == '__main__':
    database.create_tables([PageView], safe=True)
    app.run()

As you can see we have two simple views: one to serve the JavaScript, and one to analyze the data sent back from the visitor's browser and serve a 1-pixel GIF.

Getting information from the Browser

Let's begin with the JavaScript that will run on the client-side. This code will extract some basic information from the page:

There are other attributes we could also extract, which you can add if you're interested, such as:

After extracting the information, we will pass it to the analyze view in the query-string. For simplicity, we will have the JavaScript execute immediately once it is loaded by the visitor's browser, so we will wrap everything in a self-invoking anonymous function. Finally, we will use the browser's encodeURIComponent function to make values safe for passing through the query-string:

(function() {
  var img = new Image,
      url = encodeURIComponent(document.location.href),
      title = encodeURIComponent(document.title),
      ref = encodeURIComponent(document.referrer);
  img.src = '%s/a.gif?url=' + url + '&t=' + title + '&ref=' + ref;
})();

We've left a placeholder using the Python string interpolation parameter %s to allow our app to pass in the DOMAIN configuration value.

Replace the JAVASCRIPT configuration value in your application file with the following "minified" version of the above JavaScript code:

# Simple JavaScript which will be included and executed on the client-side.
JAVASCRIPT = """(function(){
    var d=document,i=new Image,e=encodeURIComponent;
    i.src='%s/a.gif?url='+e(d.location.href)+'&ref='+e(d.referrer)+'&t='+e(d.title);
    })()""".replace('\n', '')

We can now fill in the script view to serve our javascript file:

@app.route('/a.js')
def script():
    return Response(
        app.config['JAVASCRIPT'] % (app.config['DOMAIN']),
        mimetype='text/javascript')

Storing the page-view data

The script we wrote will send three values to the analyze view, containing the page's URL, title, and referring page. We can now fill in the PageView model definition to store this data.

On the server-side, we will also be able to access the visitor's IP address and the request headers sent by the visitor's browser, so we will add columns for those values as well as the timestamp indicating when the request was made.

Since each browser may send a different collection of headers, and each page may have a different set of querystring parameters, we will store these as JSON in a TextField. If you're using Postgresql, you could also use HStore or the native JSON data-type.

Here is the definition of the PageView model, along with a simple JSONField suitable for storing the query-string parameters and request headers:

class JSONField(TextField):
    """Store JSON data in a TextField."""
    def python_value(self, value):
        if value is not None:
            return json.loads(value)

    def db_value(self, value):
        if value is not None:
            return json.dumps(value)

class PageView(Model):
    domain = CharField()
    url = TextField()
    timestamp = DateTimeField(default=datetime.datetime.now, index=True)
    title = TextField(default='')
    ip = CharField(default='')
    referrer = TextField(default='')
    headers = JSONField()
    params = JSONField()

    class Meta:
        database = database

Now we can add a method to the PageView model which will extract all the relevant values from the request. The urlparse module contains helpful functions for extracting portions of the request, and we will use this to extract the visitor's URL and the querystring parameters:

class PageView(Model):
    # ... field definitions ...

    @classmethod
    def create_from_request(cls):
        parsed = urlparse(request.args['url'])
        params = dict(parse_qsl(parsed.query))

        return PageView.create(
            domain=parsed.netloc,
            url=parsed.path,
            title=request.args.get('t') or '',
            ip=request.headers.get('X-Forwarded-For', request.remote_addr),
            referrer=request.args.get('ref') or '',
            headers=dict(request.headers),
            params=params)

The final step will be to fill in the analyze view. This view will create a new PageView and return a 1-pixel GIF. As a safeguard, we will check for the presence of a URL in the querystring to ensure we don't accidentally create blank rows:

@app.route('/a.gif')
def analyze():
    if not request.args.get('url'):
        abort(404)

    with database.transaction():
        PageView.create_from_request()

    response = Response(app.config['BEACON'], mimetype='image/gif')
    response.headers['Cache-Control'] = 'private, no-cache'
    return response

Running the app

If you'd like to test out the app at this point, you can run it in debug mode by specifying DEBUG=1 on the command-line:

(analytics) $ DEBUG=1 python analytics.py
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with reloader

You can view the javascript by loading http://127.0.0.1:5000/a.js . If you have another web-app running locally, you can add the following tag to one of the pages to test the analytics app:

<script src="http://127.0.0.1:5000/a.js" type="text/javascript"></script>

To deploy the app to a product environment, I'd suggest looking into using a dedicated WSGI server. I like using gevent because it is extremely lightweight and provides great performance. You can modify analytics.py to serve requests using gevent instead of the Flask server. The following code will run the analytics app on port 5000 using gevent:

if __name__ == '__main__':
   from gevent.wsgi import WSGIServer
   WSGIServer(('', 5000), app).serve_forever()

Because gevent uses monkey-patching to achieve it's high concurrency, it is necessary to add the following line to the very top of the analytics.py file:

from gevent import monkey; monkey.patch_all()

Querying the Data

The real fun begins after you've started to collect data for a couple days and can run queries on it. In this section we'll look at some interesting ways we can query the data collected by the analytics app.

Using data from my blog, we'll run some queries on the past seven days of traffic:

>>> from analytics import *
>>> import datetime
>>> week_ago = datetime.date.today() - datetime.timedelta(days=7)
>>> base = PageView.select().where(PageView.timestamp >= week_ago)

First off, let's see how many page-views I got during the past week:

>>> base.count()
1133

How many different IPs visited my site?

>>> base.select(PageView.ip).group_by(PageView.ip).count()
850

What are the top 10 pages?

print (base
       .select(PageView.title, fn.Count(PageView.id))
       .group_by(PageView.title)
       .order_by(fn.Count(PageView.id).desc())
       .tuples())[:10]

# Prints...
[('Postgresql HStore, JSON data-type and Arrays with Peewee ORM',
  88),
 ("Describing Relationships: Django's ManyToMany Through",
  73),
 ('Using python and k-means to find the dominant colors in images',
  66),
 ('SQLite: Small. Fast. Reliable. Choose any three.', 58),
 ('Using python to generate awesome linux desktop themes',
  54),
 ("Don't sweat the small stuff - use flask blueprints", 51),
 ('Using SQLite Full-Text Search with Python', 48),
 ('Home', 47),
 ('Blog Entries', 46),
 ('Django Patterns: Model Inheritance', 44)]

During what four hour period of the day do I receive the most traffic?

hour = fn.date_part('hour', PageView.timestamp) / 4
id_count = fn.Count(PageView.id)
print (base
       .select(hour, id_count)
       .group_by(hour)
       .order_by(id_count.desc())
       .tuples())[:]
[(3, 208),
 (2, 201),
 (0, 194),
 (1, 183),
 (4, 178),
 (5, 169)]

Based on these numbers, it looks like I get most of my traffic mid-day around lunch-time, and the least amount of traffic in the late evening before midnight, but overall traffic is fairly even.

What are some of the most popular user-agents?

from collections import Counter
c = Counter(pv.headers.get('User-Agent') for pv in base)
print c.most_common(5)
[(u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
  81),
 (u'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
  70),
 (u'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0',
  50),
 (u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2',
  37),
 (u'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0',
  37)]

It's basically up to you what you want to do with the data. One fun query will generate a list of all the pages, in order, that were visited by a particular IP address. This can shed some light on how people browse your site from page-to-page:

inner = base.select(PageView.ip, PageView.url).order_by(PageView.timestamp)
query = (PageView
         .select(PageView.ip, fn.GROUP_CONCAT(PageView.url).alias('urls'))
         .from_(inner.alias('t1'))
         .group_by(PageView.ip)
         .order_by(fn.Count(PageView.url).desc())
print {pv.ip: pv.urls.split(',') for pv in query[:10]}

# Prints something like the following:
{
  u'xxx.xxx.xxx.xxx': [
    u'/blog/peewee-was-baroque-so-i-rewrote-it/',
    u'/blog/peewee-was-baroque-so-i-rewrote-it/',
    u'/blog/',
    u'/blog/postgresql-hstore-json-data-type-and-arrays-with-peewee-orm/',
    u'/blog/search/',
    u'/blog/the-search-for-the-missing-link-what-lies-between-sql-and-django-s-orm-/',
    u'/blog/how-do-you-use-peewee-/'],
  u'xxx.xxx.xxx.xxx': [
    u'/blog/dont-sweat-small-stuff-use-flask-blueprints/',
    u'/',
    u'/blog/',
    u'/blog/migrating-to-sqlite/',
    u'/blog/',
    u'/blog/saturday-morning-hacks-revisiting-the-notes-app/'],
  u'xxx.xxx.xxx.xxx': [
    u'/blog/using-python-to-generate-awesome-linux-desktop-themes/',
    u'/',
    u'/blog/',
    u'/blog/customizing-google-chrome-s-new-tab-page/',
    u'/blog/-wallfix-using-python-to-set-my-wallpaper/',
    u'/blog/simple-botnet-written-python/'],
  # etc...
}

Ideas for improving the app

Thanks for reading

Thanks for taking the time to read this post, I hope you found it interesting! Feel free to leave a comment below or contact me if you have any questions.

You can find the source code for the analytics app and the "reports" hosted in this GitHub gist.

If you enjoyed this post and are looking for more projects like this, check out the list of saturday-morning hack posts.

Comments (5)

Charles | oct 04 2014, at 01:37pm

Thanks Hendra, I'm actually hacking right now on some improvements to the analytics app (including a nice little dashboard)!

The reason the <img> tag is used is because typically you cannot make cross-domain Ajax requests. I run my analytics service on a different domain, and I also use the analytics service to collect data for multiple sites, so Ajax wasn't an option for me.

Hendra | oct 04 2014, at 01:34pm

Awesome read. Really got me into the weekend morning hacks mood.

A quick question, why make the GET request via the img element instead of just calling it directly via ajax call on the js?

Mark | oct 02 2014, at 09:18pm

Very informative post.

Charles | oct 02 2014, at 02:48pm

Thanks so much for the kind words, Marko. I really like the simplicity and low-overhead of using gevent. I also use gevent for non-blocking I/O, so since I'm already using it I might as well use it for my server. I have a single-core server to serve my sites, so there's no real need for me to go with something bigger. I don't have a lot of experience with CherryPy but I hear it's pretty fantastic, so you should be fine either way.

Marko | oct 02 2014, at 02:43pm

As always it is a great pleasure to read your simple yet awesome and effective code. The implementation is great and I'm thinking about reusing some of it for a mobile flask-based geolocation app. What got me thinking was your decision to use gevent to serve the flask app. I am working on a small-to-medium office app in flask and I thought of deploying it (on windows) with the cherrypy wrapper. Do you think gevent is a better solution?


Commenting has been closed, but please feel free to contact me