I recently heard a talk from a coworker wherein one of the things he discussed was automatically converting CSV data for use with a SQLite database. I thought this would be a great thing to add to peewee, especially as lately I've found myself on several occasions working with CSV and battling with it in a spreadsheet. It would be much easier to load it into a database and then query it using a tool I'm familiar with.
Which brings me to
playhouse.csv_loader, a new module I've added to the
playhouse package of extras. It's hopefully really easy to use. Here is an
example of how you might use it:
>>> from playhouse.csv_loader import * >>> db = SqliteDatabase(':memory:') # Create an in-memory sqlite database # Load the CSV file into the in-memory database and return a Model suitable # for querying the data. >>> ZipToTZ = load_csv(db, 'zipcode_to_timezone.csv') # Get the timezone for a zipcode. >>> ZipToTZ.get(ZipToTZ.zip == 66047).timezone 'US/Central' # Get all the zipcodes for my town. >>> [row.zip for row in ZipToTZ.select().where( ... (ZipToTZ.city == 'Lawrence') && (ZipToTZ.state == 'KS'))] [66044, 66045, 66046, 66047, 66049]
Chris Dary crawled reddit and grabbed the top 1,000 posts from the top 2,500 subreddits by subscriber and dumped them into CSV files. The data can be obtained on his GitHub page: https://github.com/umbrae/reddit-top-2.5-million
$ git clone https://github.com/umbrae/reddit-top-2.5-million reddit-data
Let's load up all the files and run some queries! All the files are about 1.6GB
of CSV data (~2.3M rows) so it takes a few minutes to load. I'm passing in a
sample_size=0 so that the introspector doesn't try to "introspect"
all 1000 files. I'm also specifying
db_table='reddit' so that all the CSV files
are read into the same table (otherwise they would be read into tables based on
their filename). Immediately after loading the data, I'm going to print out
all the field names so we know what we're looking at:
>>> db = SqliteDatabase(':memory:') >>> import glob >>> for csv_file in glob.glob('reddit-data/data/*.csv'): ... print 'Loading', csv_file ... R = load_csv(db, csv_file, sample_size=0, db_table='reddit') >>> R._meta.get_field_names() # What are the columns? ['_auto_pk', 'created_utc', 'score', 'domain', 'id', 'title', 'author', 'ups', 'downs', 'num_comments', 'permalink', 'selftext', 'link_flair_text', 'over_18', 'thumbnail', 'subreddit_id', 'edited', 'link_flair_css_class', 'author_flair_css_class', 'is_self', 'name', 'url', 'distinguished']
Here are some example queries:
>>> R.select(fn.Avg(R.ups / R.downs)).scalar() # What is the average ratio of upvotes to downvotes? 7.438 >>> R.select(fn.Avg(R.num_comments)).scalar() # Average number of comments? 42.189 >>> list(R ... .select(R.domain, fn.Count(R.id)) ... .group_by(R.domain) ... .order_by(fn.Count(R.id).desc()) ... .limit(10) ... .tuples()) # Top 10 domains for links? [(u'i.imgur.com', 478656), (u'imgur.com', 295313), (u'youtube.com', 172186), (u'reddit.com', 25445), (u'flickr.com', 17854), (u'youtu.be', 16340), (u'soundcloud.com', 10397), (u'quickmeme.com', 9561), (u'i.minus.com', 8549), (u'twitter.com', 7611)]
As I hope you see, it is very easy to start exploring the data if you're familiar with SQL. If you're not very familiar with SQL, there are tons of resources online for learning.
This is a very new feature so there may be some rough edges, but I hope that this may help your work with CSV go more smoothly. For details check out the documentation.
If you're new to peewee, I hope this has piqued your interest! There are a lot of interesting free data-sets out there, here are some links to get you started:
Lastly, if you're really interested in exploring data with python, I would highly suggest looking into the pandas library. I'd also suggest iPython notebook, which is a web frontend for iPython (the improved python shell) that has integrated support for matplotlib and markdown-formatted text.
Thanks for reading! As a reward for making it this far, here's a picture of my cat Huey:
Commenting has been closed, but please feel free to contact me