November 07, 2013 06:19 / 3 comments / peewee python

I recently heard a talk from a coworker wherein one of the things he discussed was automatically converting CSV data for use with a SQLite database. I thought this would be a great thing to add to peewee, especially as lately I've found myself on several occasions working with CSV and battling with it in a spreadsheet. It would be much easier to load it into a database and then query it using a tool I'm familiar with.

Which brings me to playhouse.csv_loader, a new module I've added to the playhouse package of extras. It's hopefully really easy to use. Here is an example of how you might use it:

>>> from playhouse.csv_loader import *
>>> db = SqliteDatabase(':memory:')  # Create an in-memory sqlite database

# Load the CSV file into the in-memory database and return a Model suitable
# for querying the data.
>>> ZipToTZ = load_csv(db, 'zipcode_to_timezone.csv')

# Get the timezone for a zipcode.
>>> ZipToTZ.get(ZipToTZ.zip == 66047).timezone
'US/Central'

# Get all the zipcodes for my town.
>>> [row.zip for row in ZipToTZ.select().where(
...     (ZipToTZ.city == 'Lawrence') && (ZipToTZ.state == 'KS'))]
[66044, 66045, 66046, 66047, 66049]

Exploring Reddit data

Chris Dary crawled reddit and grabbed the top 1,000 posts from the top 2,500 subreddits by subscriber and dumped them into CSV files. The data can be obtained on his GitHub page: https://github.com/umbrae/reddit-top-2.5-million

$ git clone https://github.com/umbrae/reddit-top-2.5-million reddit-data

Let's load up all the files and run some queries! All the files are about 1.6GB of CSV data (~2.3M rows) so it takes a few minutes to load. I'm passing in a parameter sample_size=0 so that the introspector doesn't try to "introspect" all 1000 files. I'm also specifying db_table='reddit' so that all the CSV files are read into the same table (otherwise they would be read into tables based on their filename). Immediately after loading the data, I'm going to print out all the field names so we know what we're looking at:

>>> db = SqliteDatabase(':memory:')
>>> import glob
>>> for csv_file in glob.glob('reddit-data/data/*.csv'):
...     print 'Loading', csv_file
...     R = load_csv(db, csv_file, sample_size=0, db_table='reddit')
>>> R._meta.get_field_names()  # What are the columns?
['_auto_pk',
 'created_utc',
 'score',
 'domain',
 'id',
 'title',
 'author',
 'ups',
 'downs',
 'num_comments',
 'permalink',
 'selftext',
 'link_flair_text',
 'over_18',
 'thumbnail',
 'subreddit_id',
 'edited',
 'link_flair_css_class',
 'author_flair_css_class',
 'is_self',
 'name',
 'url',
 'distinguished']

Here are some example queries:

>>> R.select(fn.Avg(R.ups / R.downs)).scalar()  # What is the average ratio of upvotes to downvotes?
7.438

>>> R.select(fn.Avg(R.num_comments)).scalar()  # Average number of comments?
42.189

>>> list(R
...     .select(R.domain, fn.Count(R.id))
...     .group_by(R.domain)
...     .order_by(fn.Count(R.id).desc())
...     .limit(10)
...     .tuples())  # Top 10 domains for links?
[(u'i.imgur.com', 478656),
 (u'imgur.com', 295313),
 (u'youtube.com', 172186),
 (u'reddit.com', 25445),
 (u'flickr.com', 17854),
 (u'youtu.be', 16340),
 (u'soundcloud.com', 10397),
 (u'quickmeme.com', 9561),
 (u'i.minus.com', 8549),
 (u'twitter.com', 7611)]

As I hope you see, it is very easy to start exploring the data if you're familiar with SQL. If you're not very familiar with SQL, there are tons of resources online for learning.

Check it out

This is a very new feature so there may be some rough edges, but I hope that this may help your work with CSV go more smoothly. For details check out the documentation.

If you're new to peewee, I hope this has piqued your interest! There are a lot of interesting free data-sets out there, here are some links to get you started:

Happy hacking, please feel free to leave any questions or comments below! If you're interested in learning more about peewee check the docs: http://docs.peewee-orm.com/

Lastly, if you're really interested in exploring data with python, I would highly suggest looking into the pandas library. I'd also suggest iPython notebook, which is a web frontend for iPython (the improved python shell) that has integrated support for matplotlib and markdown-formatted text.

Thanks for reading! As a reward for making it this far, here's a picture of my cat Huey:

Huey nap

Comments (3)

robert | nov 07 2013, at 10:33pm

what's playhouse ?

Charlie Leifer | nov 08 2013, at 07:02am

Playhouse is a collection of extensions to the peewee orm. It comes with the library, so its not a separate install.

Docs


Commenting has been closed, but please feel free to contact me