Micawber, a python library for extracting rich content from URLs

April 19, 2012 11:13 / django micawber oembed / 0 comments

OEmbed is a simple, open API standard for embedding rich content and retrieving content metadata. The way OEmbed works is actually kind of ingenious, because the only things a consumer of the API needs to know are the location of the OEmbed endpoint, and the URL to the piece of content they want to embed.

YouTube, for example, maintains an OEmbed endpoint at youtube.com/oembed. Using the OEmbed endpoint, we can very easily retrieve the HTML for an embedded video player along with metadata about the clip:

GET https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=nda_OSWeyn8

Response:

{
  "provider_url": "https://www.youtube.com/", 
  "title": "Leprechaun in Mobile, Alabama", 
  "type": "video", 
  "html": "<iframe width=\"459\" height=\"344\" src=\"https://www.youtube.com/embed/nda_OSWeyn8?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>", 
  "thumbnail_width": 480, 
  "height": 344, 
  "width": 459, 
  "version": "1.0", 
  "author_name": "botmib", 
  "thumbnail_height": 360, 
  "thumbnail_url": "https://i.ytimg.com/vi/nda_OSWeyn8/hqdefault.jpg", 
  "provider_name": "YouTube", 
  "author_url": "https://www.youtube.com/user/botmib"
}

The oembed spec defines four types of content along with a number of required attributes for each content type. This makes it a snap for consumers to use a single interface for handling things like:

youtube videos
flickr photos
hulu videos
slideshare decks
and many more

A quick note on embed.ly

If you click that last link in the list it will send you to http://embed.ly/ -- a service that launched a year or so ago that provides a single endpoint for all sorts of different content. Many big sites provide their own endpoints, however, so the decision to use a service like embedly really depends on your individual needs. I tried out their free tier and found it to be much slower than using the native endpoints provided by youtube and flickr, however the sheer number of sites they support makes them a pretty good option. Luckily, you don't have to decide right now, micawber supports both workflows.

Back to micawber

Micawber was designed for embedding rich content using the oembed API. In many ways it is a successor to an earlier project djangoembed, which I have not been very good at maintaining, but instead of being limited to django micawber can be used with any python project. It supports a low-level API capable of:

requesting rich metadata for a URL from a given endpoint
extracting metadata from a block of text or html
parsing a block of text or HTML and replacing URLs with rich content

If you're using Flask or Django, there is a higher-level API consisting of a couple of template filters which do the same things.

I put a demo up on appengine (hoping it doesn't break too bad, this will be my first appengine deploy). Try entering some URLs to things like youtube videos or flickr photos: http://micawberdemo.appspot.com/

Providers

Behind-the-scenes, your app creates a mapping of partial URL regex to a particular endpoint, e.g.:

http://\S*.youtu(\.be|be\.com)/watch\S*  -->  http://www.youtube.com/oembed

What happens when you ask the youtube oembed endpoint for metadata about a video simply by providing that video's URL?

curl http://www.youtube.com/oembed?url=http://www.youtube.com/watch?v=nda_OSWeyn8

Results in the following output:

{'author_name': u'botmib',
 'author_url': u'http://www.youtube.com/user/botmib',
 'height': 344,
 'html': u'<iframe width="459" height="344" src="http://www.youtube.com/embed/nda_OSWeyn8?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>',
 'provider_name': u'YouTube',
 'provider_url': u'http://www.youtube.com/',
 'thumbnail_height': 360,
 'thumbnail_url': u'http://i3.ytimg.com/vi/nda_OSWeyn8/hqdefault.jpg',
 'thumbnail_width': 480,
 'title': u'Leprechaun in Mobile, Alabama',
 'type': u'video',
 'version': u'1.0',
 'width': 459}

Using these providers it is a snap to add nice thumbnail previews of content within blocks of text, or to even parse blocks of text or HTML and replace URLs with rich content (e.g. URL becomes flash player or img tag).

For simplicity, micawber comes with two "bootstrap" functions to get you a prepopulated list of providers:

bootstrap_basic(), which loads up a list of providers with native endpoints
bootstrap_embedly(), which asks embedly for a list of providers and configures them

Interactive shell session

Below is an annotated interactive shell session showing how these components work.

Import micawber and load up a list of providers. It comes prepopulated with a handful of providers.

>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> providers
<micawber.providers.ProviderRegistry at 0x2681690>

>>> providers._registry
{'http://\S*.youtu(\\.be|be\\.com)/watch\\S*': <micawber.providers.Provider at 0x2681d90>,
 'http://\S*?flickr.com/\\S*': <micawber.providers.Provider at 0x2681d50>,
 'http://vimeo.com/\S*': <micawber.providers.Provider at 0x2681e10>,
 'http://www.hulu.com/watch/\S*': <micawber.providers.Provider at 0x2681dd0>,
 'http://www.slideshare.net/[^\\/]+/\S*': <micawber.providers.Provider at 0x2681e50>}

Request some metadata about a URL we know about and a dictionary is returned. All metadata returned follows the oembed spec, which specifies various response parameters:

>>> providers.request('http://www.youtube.com/watch?v=nda_OSWeyn8')
{'author_name': u'botmib',
 'author_url': u'http://www.youtube.com/user/botmib',
 'height': 344,
 ...
}

URLs we do not have providers for will raise ProviderException

>>> providers.request('http://www.google.com/')
ProviderException: Provider not found for "http://www.google.com/"

There are higher-level functions which can parse text or HTML, either replacing the URLs with rich content or extracting the metadata and returning it in a dictionary. The extract functions return a 2-tuple containing a list of all URLs in order of appearance, and then a dictionary keyed by URL containing any URLs we found metadata for:

>>> micawber.extract("http://google.com/ and http://www.youtube.com/watch?v=nda_OSWeyn8", providers)
(['http://google.com/', 'http://www.youtube.com/watch?v=nda_OSWeyn8'],
 {'http://www.youtube.com/watch?v=nda_OSWeyn8': {
      'author_name': u'botmib',
      'author_url': u'http://www.youtube.com/user/botmib',
      'height': 344,
      ... etc ...
      }
 })

>>> print micawber.parse_text("this is a test\nhttp://www.youtube.com/watch?v=nda_OSWeyn8", providers)
this is a test
<iframe width="459" height="344"
  src="http://www.youtube.com/embed/nda_OSWeyn8?fs=1&feature=oembed"
  frameborder="0" allowfullscreen></iframe>

Finally, if using Django or Flask there are template filters for doing the same:

>>> from django.template import Template, Context
>>> t = Template('{% load micawber_tags %}{{ "http://www.youtube.com/watch?v=nda_OSWeyn8"|oembed }}')
>>> t.render(Context())
<iframe width="459" height="344"
  src="http://www.youtube.com/embed/nda_OSWeyn8?fs=1&feature=oembed"
  frameborder="0" allowfullscreen></iframe>

Reading more

If you're interested in learning more about the project, check out the documentation, hosted on readthedocs. You can also browse the source code, hosted on GitHub. There's a live demo hosted on appengine: http://micawberdemo.appspot.com/

Hope you enjoyed reading about this project, I've had a lot of fun working on it. Please let me know if you have any questions or suggestions about this project by leaving a comment or contacting me.

An example, for your viewing pleasure

Comments (0)

Commenting has been closed.