Playing with Python Magic Methods to make a nicer Regex API

p1405787444.31.png

A co-worker of mine mentioned that he missed Ruby's syntactic sugar for regular expressions. I haven't used Ruby's regular expressions, but I'm familiar enough with Python's to know that the API is a bit wanting in syntactic sweetness.

First, retrieving capture groups from a regular expression requires two steps. In the first step you need to call either match() or search() and assign the result to a variable. Then, you need to check whether the result is not None (indicating no match was found). Finally, if a match exists, you can safely extract the captured groups. Here is an example:

>>> import re
>>> match_obj = re.match('([0-9]+)', '123foo')
>>> match_obj  # What is `match_obj`?
<_sre.SRE_Match object at 0x7fd1bb000828>
>>> match_obj.groups()
('123',)

>>> match_obj = re.match('([0-9]+)', 'abc')
>>> match_obj
None

It would be nicer, in my opinion, to have something like:

>>> re.get_matches('([0-9]+)', '123foo')
('123',)

>>> re.get_matches('([0-9]+)', 'abc')
None

The other thing I frequently run into is mixing up the parameters for re.sub, which performs find-and-replace. The required parameters, in order, are pattern, replacement, search_string. For whatever reason, it seems more intuitive to me to have search_string come before replacement.

Unfortunately, mangling these parameters can lead to "correct-looking" results. Here is an example. The goal here will be to replace the word foo with the word bar.

>>> re.sub('foo', 'replace foo with bar', 'bar')
'bar'

>>> re.sub('foo', 'bar', 'replace foo with bar')
'replace bar with bar'

In the first example, we might presume that the input string was just "foo".

Sugar

For fun, I put together a little helper class that adds some syntactic sweetness to python's regular expression library. I don't really suggest that anyone should use this, but it was fun to make and maybe it will give you some ideas on how you might improve the syntax of other libraries.

Before I show you the implementation, here are some examples of the API I devised.

Searching for matches is a single-step operation:

>>> def has_lower(s):
...     return bool(R/'[a-z]+'/s)

>>> has_lower('This contains lower-case')
True
>>> has_lower('NO LOWER-CASE HERE!')
False

Retrieving capture-groups is also easy:

>>> list(R/'([0-9]+)'/'extract 12 the 456 numbers')
['12', '456']

Finally, you can use the division operator one more time to perform replacements:

>>> R/'(foo|bar)'/'replace foo and bar'/'Huey!'
'replace Huey! and Huey!'

What do you think? More fun?

Implementation

The implementation is pretty straightforward and relies on Python's magic methods to provide the API. If there's a neat trick, it is the use of metaclasses to implement what is essentially a classmethod operator overload.

import re

class _R(type):
    def __div__(self, regex):
        return R(regex)

class R(object):
    __metaclass__ = _R

    def __init__(self, regex):
        self._regex = re.compile(regex)

    def __div__(self, s):
        return RegexOperation(self._regex, s)

class RegexOperation(object):
    def __init__(self, regex, search):
        self._regex = regex
        self._search = search

    def search(self):
        match =  self._regex.search(self._search)
        if match is not None:
            return match.groups()

    def __len__(self):
        return self._regex.search(self._search) is not None

    def __div__(self, replacement):
        return self._regex.sub(replacement, self._search)

    def __iter__(self):
        return iter(self._regex.findall(self._search))

Stepping through the operations one-by-one, hopefully it will clarify what is going on behind-the-scenes.

Calling R / <something> will invoke the __div__ method on the _R class, which is basically a factory method for creating R instances:

>>> R/'foo'
<rx.R at 0x7f77c00831d0>

Then, invoking __div__ on the newly-created R object, we get a RegexOperation instance, so R.__div__ is another factory method.

>>> r_obj = R/'foo'
>>> r_obj / 'bar'
<rx.RegexOperation at 0x7f77c00837d0>

The final object, RegexOperation, implements several magic methods which allow us to retrieve matches, perform substitions, and test for the existence of a match.

Thanks for reading

Thanks for taking the time to read this post, I hope you found this interesting! Feel free to leave a comment below.

Comments (5)

Jonathan Hartley | jul 24 2014, at 12:50pm

Really great ideas, thanks for writing it up.

One minor point: If we're improving the API to re, then does it make more sense to return an empty collection when there are no matches, rather than None. That way, you let people simply say 'for match in MATCH_EXPR' without having to check for None the whole time.

Abhas | jul 24 2014, at 12:27pm

Just wanted to say that I really loved this lib called parsely for writing complicated regexp. Surely thats not the problem that you want to address, but I was really impressed by it.

Tomasz Wyderka | jul 21 2014, at 01:52am

I'm impressed Charles! Excellent operator overloading. Like in good C++ code. Actually I don't see much python code with operator overloading. Pure pleasure to try your code!

Of course not everything can be done in python. Passing flags to compile() will require lot of extra work. Like John noticed findall() solves some of your problems. I think most misunderstandings about RE module has sources in too compressed documentation. So I wrote this "advanced" python RE tutorial, maybe will be useful for you too: http://www.cofoh.com/advanced-regex-tutorial-python/traps

John Strickler | jul 19 2014, at 11:16pm

Sorry about the above mess -- I need to learn to work the comment feature better.

I wanted to add that re.findall() returns an empty list when there are no matches.

Also, re.finditer() provides an iterator over each match object, if you need that.

John Strickler | jul 19 2014, at 11:13pm

+1 for cool use of operator overload -1 for reinventing a wheel

Your code:

>>> list(R/'([0-9]+)'/'extract 12 the 456 numbers')
['12', '456']

Is actually more complicated IMO than native Python:

>>> re.findall('([0-9]+)', 'extract 12 the 456 numbers')
['12', '456']

If there are no explicit groups, re.findall() will return a list consisting of group 0 for all matches; if one or more groups are defined, re.findall() returns a list of tuples containing the matched data for each group:

>>> re.findall('([A-Z])([0-9]+)', 'A133 B873 xxx yyy C946')
[('A', '133'), ('B', '873'), ('C', '946')]

Very interesting post in any case. Thanks!

I agree with you on the order of arguments to re.sub.

I disagree about the


Commenting has been closed, but please feel free to contact me