Peewee, a lightweight Python ORM - Original Post

Edit

For the past month or so I've been working on writing my own ORM in Python. The project grew out of a need for a lightweight persistence layer for use in Flask web apps. As I've grown so familiar with the Django ORM over the past year, many of the ideas in Peewee are analagous to the concepts in Django. My goal from the beginning has been to keep the implementation simple without sacrificing functionality, and to ultimately create something hackable that others might be able to read and contribute to.

Weighing in at about 1000 lines of code, Peewee doesn't come close to matching Django's ORM (15K LOC) in terms of API cleanliness or functionality, but it does hit many of the basic use-cases for an app that needs lightweight persistence and querying. This has definitely been one of the most rewarding projects I've worked on!

Benchmarks

In terms of speed, peewee is generally 25% faster than django when creating rows or grabbing simple lists of objects. Peewee is 77% faster than Django for simple one-row calls to ".get()", and almost 50% faster when doing ".get_or_create()". When doing a single join, peewee is only 23% faster, but when executing a comparable query using a subquery instead of a join, peewee is almost 60% faster than django!

.                              | django_bench | peewee_bench | djang diff |
                 test_creation |    11.441360 |     9.212356 | 19.481986% |
           test_get_user_count |     0.048042 |     0.023086 | 51.946383% |
               test_list_users |     2.612983 |     2.037555 | 22.021881% |
       test_list_users_ordered |     3.022387 |     2.397535 | 20.674121% |
                 test_get_user |     0.232575 |     0.053378 | 77.049199% |
            test_get_or_create |     0.789696 |     0.406243 | 48.557027% |
      test_list_blogs_for_user |     1.423771 |     0.646669 | 54.580533% |
    test_list_entries_for_user |     1.684540 |     1.292385 | 23.279645% |
    test_list_entries_subquery |     2.413966 |     0.979424 | 59.426772% |

Benchmark code can be viewed here. The benchmarks were run against Django 1.2.3 final. Both benchmarks used an on-disk SQLite database.

Aspects of the design

Cribbed from Django

Off in left-field

Examples of some queries

# a simple query selecting a user
User.get(username='charles')

# get the staff and super users
editors = User.select().where(Q(is_staff=True) | Q(is_superuser=True))

# get tweets by editors using a subquery
Tweet.select().where(user__in=editors)

# get tweets by editors using a join
Tweet.select().join(User).where(Q(is_staff=True) | Q(is_superuser=True))

# how many active users are there?
User.select().where(active=True).count()

# paginate the user table and show me page 3 (users 41-60)
User.select().order_by(('username', 'asc')).paginate(3, 20)

# order users by number of tweets
User.select({
    User: ['*'],
    Tweet: [Count('id', 'num_tweets')]
}).group_by('id').join(Tweet).order_by(('num_tweets', 'desc'))

Example App

tweepee

I originally wrote peewee to provide some lightweight persistence for flask apps. As an example app, I've written a (cheesy) twitter-alike -- the entire app is under 250 lines of code and exemplifies quite a few of peewee's features.

For instructions on running the example app yourself, or for an in-depth walkthrough, check the example app docs.

Documentation

If you're interested in readming more, please check out the documentation. The docs are currently broken up into 3 main sections:

Conclusion

I've had a great time working on this code and plan on continuing to develop on it. Please feel free to contribute if you're interested! I'll close with a quote from the Art of Unix Programming:

"Software design and implementation should be a joyous art, a kind of high-level play."

EDIT:

Based on reader interest, I put together a set of benchmarks for SQLAlchemy's ORM. The only benchmark I had trouble replicating was the test_list_entries_subquery one, but the other ones should be right. Any SQLAlchemy users out there, I'd appreciate if you could sanity check my benchmarking code! SQLAlchemy outperformed Peewee when listing the blogs for a user (selecting on the back side of a ForeignKey), but performance was quite a bit slower in all other tests:

.                          |django_bench |peewee_bench |sqlalc_bench | djang diff | sqlal diff |
             test_creation |   10.585136 |    9.183795 |   16.681107 | 13.238762% | 44.944931% |
       test_get_user_count |    0.040043 |    0.019856 |    0.085400 | 50.413509% | 76.749471% |
           test_list_users |    2.769415 |    2.124650 |    3.540615 | 23.281629% | 39.992065% |
   test_list_users_ordered |    3.154166 |    2.533434 |    3.901886 | 19.679759% | 35.071555% |
             test_get_user |    0.236253 |    0.053331 |    0.221075 | 77.426364% | 75.876565% |
        test_get_or_create |    0.644203 |    0.434922 |    0.618659 | 32.486832% | 29.699242% |
  test_list_blogs_for_user |    1.683750 |    0.684124 |    0.661601 | 59.369032% | -3.404306% |
test_list_entries_for_user |    1.866349 |    1.330273 |    2.662108 | 28.723246% | 50.029331% |
test_list_entries_subquery |    2.621584 |    1.029036 |    0.002033 | 60.747545% | N/A        |

Comments (15)

Charles | dec 10 2010, at 11:12am

I'm glad you dug my project! I have to disagree with you, though, that sql-specific abstractions are a thing of the past. I think you're talking about the limitations of any data store abstraction. To write to a certain backend (or type of backend, in this case SQL databases) you take for granted that certain features will be present - when building for many backends, each of which is implementing a different feature-set, you end up with a lowest common denominator solution, which was definitely not the purpose of this project. I would disagree with you that "existing ORM layers are so limited", and I think you can only take a generalization so far. The concept of JOINing is innate to relational dbs but is generally a hack w/NoSQL or K/V stores.

data abstraction must be more general | dec 10 2010, at 07:03am

This looks very interesting - but is it tied to SQL? We can learn from recent developments that the existing ORM layers are so limited because of the strong binding to SQL - any new crud layer should bind to an abstraction of (any) DATASTORE not only to SQL databases, don´t you think? The abstraction of SQLonly gives you not enough freedom of choice for an appropriate flexibility and growth of your web application - rewriting your app if you change your datastore is exactly what an abstraction layer should help to avoid! Would be nice to see a more general solution to this and not hacking around sql-specific limits :) Please don´t get me wrong - I appreciate your project and your efforts are most valuable and I have biggest respect - but I think an sql only approach is an unnecessary limitation for webapps nowadays. yes, I am expecting a "not a usecase"-answer, but nevertheless I think the times are gone for "sqlonly" abstraction layers...

Peter Bengtsson | dec 01 2010, at 04:18am

I'm VERY excited about this. I've got an old Zope app that has all SQL handcoded as templates. It's really tight and very fast and was created before Django even existed. Peewee is going to be great for trivial chunks of SQL like updating a user account's birthday or something.

Keep up the good work!

Charles | nov 30 2010, at 10:15am

Most of the performance increase is happening "in python" - as Mike pointed out in the comments on HN, peewee has a much shallower call stack than either of the other ORMs. Django also does a good deal of object cloning/deepcopy-ing which may be coming into play as well. I haven't profiled the code and would like to at some point but if you want to give it a shot you might try http://www.vrplumber.com/programming/runsnakerun/

Nip | nov 29 2010, at 06:26pm

Care to elaborate on how you achieved the speed improvements over Django? Maybe the Django project can benefit from your work as well...

Peter Baumgartner | nov 29 2010, at 01:36pm

Hey Charles, probably a little late, but you may find this interesting, http://autumn-orm.org/

mike bayer | nov 28 2010, at 11:25pm

That is true, benchmarks that just rank the Python overhead of individual SQL statement construction and executions will tend to not favor SQLAlchemy very well, as it takes a lot of decisionmaking within those processes to meet SQLA's behavioral contract and we only have a very small amount of code written in C as of yet. Fortunately being able to cut down on overall number of statements required is a great way to make an end-run around the thicker execution callstack.

Charles | nov 28 2010, at 11:07pm

Thanks shazow & Mike - this was my first time using SQLAlchemy. I am not familiar with the internals but I figured the atoms of the benchmarks were all so simple that the results would be somewhat representative. The flawed assumption there, as Mike pointed out, is that its racking up time due to expired data or overly-simplistic queries.

That being said, expiring the session results is, more-or-less, what I intend to do. If you peek at the django or peewee benchmarks, the results of queries are being coerced to lists, which forces the queries to be executed and evaluated.

Lastly, good call on the missing indexes, Shazow. Django creates indexes on the foreign keys which are not present in the SQLAlchemy or Peewee schemas. I'll look into getting some fresh data up tomorrow. Thanks for reading (and keepin' me honest!)

mike bayer | nov 28 2010, at 10:47pm

All of those commit() calls in the SQLA test incur the overhead of a COMMIT, which the Peewee test is not doing. This is the same oversight I wrote about years ago in http://techspot.zzzeek.org/2007/12/26/revisiting-storm-sqlalchemy-and-geniusql/. SQLite has very low overhead for commit, unlike PG or MySQL, but, modern versions of SQLAlchemy also expire the contents of the session after a commit, to work properly with transaction isolation. So changing the commit() calls to flush() speeds up the "blogs for user" test by 70% on my end.

Additionally, the session is autoflushing so flush() is not needed. So removing all commit/flush altogether speeds up "blog for user" by another 50% on my end.

I'd also comment that what kills ORM performance in the real world is number of database round trips. Being able to store already-loaded collections instead of loading on every hit, as well as the ability to load all the related collections for a full set of objects in just zero or a few queries are the major features SQLAlchemy provides, which isn't illustrated by these benchmarks (nor does anyone else's since nobody else has that feature).

shazow | nov 28 2010, at 10:37pm

Nice. Couple of suggestions:

Don't re-use the same sqlite file for all the benchmarks, the order of the tests may affect the performance in that case.

Try running in-memory instead, to reduce effects of the hard drive ("sqlite:///:memory:").

Make sure the schemas are identical. The schema you used for the SQLA example is definitely less efficient than the other examples, such as lack of indices on ForeignKey columns (which I think Django puts in automatically).

Here's my quick in-progress take on it before I had to run off and noticed you already did the benchmark by the time I came back:

https://gist.github.com/719588

Zach | nov 28 2010, at 09:06pm

Would you mind throwing up some comparison notes between peewee and storm?

Oliver Beattie | nov 28 2010, at 08:31pm

This is awesome, but I wish you'd gone with SA-style query building using rich comparison methods instead of using the Django style column__operator style. But like I said, this is awesome :)

memo | nov 28 2010, at 08:22pm

Yup, i would love to see the performance of SQLAlchemy!

Charlie Leifer | nov 28 2010, at 07:54pm

Totally - I'll see what I can whip up. Thanks for reading!

shazow | nov 28 2010, at 07:30pm

You should include SQLAlchemy in your benchmarks. :)


Commenting has been closed.