Python Proposal: Accumulation Displays

(Status: this proposal will not go forward, but helped spark a discussion on a better idea--generator expressions.)

List displays (aka list comprehensions) have proven to be a useful and popular part of Python. However, there are limitations on what they can do, and they sometimes lead to inefficiencies: it is tempting to create a list of intermediate results, even when the whole list is not really needed. Consider finding the maximum termperature over the last day, where you have a function, temp, that takes an hour as argument and fetches the temperature. With list comprehensions this is:

    max([temp(hour) for hour in range(24)])

This builds up the list of temperatures, only to discard them. Not so bad in this case, but troublesome in cases when the list is multiple megabytes. In this note I propose a new syntax called accumulation displays that combines the accumulating operator (in this case max) with the iteration over a collection of values. Accumulation displays are intended to be more efficient in some cases (because no intermediate list is constructed), more extensible (you can do things beyond max, min, and sum that you just can't do by operating on a returned list), and very natural to read. An accumulation display for the example above is:

    [Max: temp(hour) for hour in range(24)]

I use "Max" rather than "max" because I want the accumulator to be different from the builtin function (they could be made the same, but I prefer not to for reasons we will soon see). Here are some more examples:

    [Sum: x*x for x in numbers]
    [Product: Prob_spam(word) for word in email_msg]
    [Min: temp(hour) for hour in range(24)]
    [Mean: f(x) for x in data]
    [Median: f(x) for x in data]
    [Mode: f(x) for x in data]
    [SortBy: abs(x) for x in (-2, -4, 3, 1)]

Syntax

The syntax for accumulation displays is:

    "[" expression ":" listmaker "]"

where listmaker is everything that can go into a list display. (See http://www.python.org/doc/current/ref/lists.html.)

Semantics

In this section I will derive the semantics of accumulation displays, and in the process show how to define the accumulators shown above (Max, Sum, etc.) as well as new ones. Let's consider the example

    [Sum: x*x for x in numbers]

we want that to be the equivalent of

    total = 0
    for x in it:
        total = (x*x + total)
    return total

That seems simple enough: we initialize a value, update it in the loop, and then return it at the end. But it is not always quite that easy. For example, in

    [Mean: f(x) for x in data]

we want

    (total, n) = (0, 0)
    for x in it:
        (total, n) = (total+x, n+1)
    return total / n

In this case we need some additional computation at the end, and the value we are updating is more complex (a two-element tuple rather than a single number). (And we have assumed true division.) We need to make some choices: let us say that an accumulator (such as Sum or Mean) is a class or other callable object which when called with no arguments returns an object with proper initial state, and that we then call the object's add method to accumulate values in the loop, and its result method to get the final result. Rather than showing the expansion of the accumulation display code into a sequence of statements, we will show it as a function call. That is,

    [acc: exp for x in it]

is equivalent to

    accumulation(acc, lambda x: exp, it)

where we have defined

    def accumulation(acc, f, it):
        a = acc()
        for x in it:
            a.add(f(x))
        return a.result()

We can now define Mean as

    class Mean:
        def __init__(self):
            self.total, self.n = 0, 0

        def add(self, exp):
            self.total, self.n = self.total+exp, self.n+1

        def result(self):
            return self.total / self.n

Variable/Value Accumulations

Sometimes you want the add method to get at both exp and x. For example, we can determine the winner of an election with

    [Argmax: votes(c) for c in candidates]

(In math "argmax" (or "arg max") gives you the element of a set with the maximal value on some function (whereas "max" gives you the maximal function value itself).) Here we are asking for the element c with the maximal value of votes(c). We can program this by adding the three characters ", x" to the definition of accumulation:

    def accumulation(acc, f, it):
        a = acc()
        for x in it:
            a.add(f(x), x)
        return a.result()

Short-Circuit Accumulations

The next step is to be able to break out of the loop when some condition is met. Consider:

    [Some: temp(hour) > 30 for hour in range(24)]

We want this to return a true value if there is some hour with temp(hour) > 30, and False if there is not. Obviously it can stop iterating when it finds the first element that satisfies the condition. We will handle this by having the accumulator's add method return a true value if it wants to stop the loop. After changing accumulation

    def accumulation(acc, f, it):
        a = acc()
        for x in it:
            if a.add(f(x), x):
                break
        return a.result()

we can define Some as:

    class Some:
        def __init__(self):
            self.value = False

        def add(self, exp, _):
            if exp:
                self.value = True
                return StopIteration

        def result(self):
            return self.value

Note I returned StopIteration (rather than, say, True) just as a hint to the reader. (I considered having the protocol raise StopIteration rather than just returning it, but I think it is simpler just to return a value.) Note that we still call the result method, even when we have broken out of the loop. If we had the add method return the final result for the accumulation it would work and be slightly simpler for Some, but wouldn't work on an accumulator that needs to exit early and then return None as its final result.

Parameterized Accumulations

Consider the problem of finding the top 10 funniest jokes for David Letterman:

    [Top(10): humor(joke) for joke in jokes]

Up to now, accumulators were passed no arguments when they were initialized, implicitly, by the function accumulation. Here we have an accumulator, Top, that takes a parameter, and all we have to do is change accumulation so that it can accept either an accumulator class or an instance:

    def accumulation(acc, f, it):
        if isinstance(acc, (types.ClassType, types.TypeType)): 
            a = acc()                                          
        else:                                                  
            a = acc
        for x in it:
            if a.add(f(x), x):
                break         
        return a.result()

Now we can define Top:

    from heapq import *

    class Top:
        def __init__(self, k):
            self.k = k
            self.heap = []

        def add(self, exp, _):
            heap = self.heap
            if len(heap) < self.k:
                heappush(heap, exp)
            elif exp < heap.biggest():
                heapreplace(heap, exp)

        def result(self):
            return self.heap

Note that this can be much more efficient than alternatives such as:

    jokes.sort(lambda a, b: humor(a) < humor(b))
    jokes[0:10]

Non-Display Accumulations

One advantage of accumulation displays is that they make it possible to invest some time in an efficient implementation of an algorithm, such as finding the top n elements of a stream of values using a priority queue. But sometimes you may want to use such an algorithm with Python statements, not within an accumulation display. If you've used list displays, you know how this happens: you start with a simple list display, but then find you want an "if" or a "print". So you convert the display to a for loop. It would be a shame if you had a nice algorithm written as an accumulator, but couldn't use when you end up converting to a loop. Fortunately, this is not a problem; you can use accumulators with regular Python statements, or even mix statements and accumulation displays. Consider:

    a = Top(10)
    for j in new_jokes:
      print j
      a.add(float(raw_input("Rating?")), j)
    return [a: humor(j) for j in machine_ratable_jokes]

Here we need a print statement, so we explicitly manage the accumulator a, filling it with results derived by asking the user for a rating, and then we use the same accumulator in a display to add in some more results and return the final list of top 10 results.

Discussion

List displays are powerful and widely-used. This proposal extends the range of places where displays can be used. One way to look at it is that list displays implement map and filter but not reduce nor other common higher-order functions. This proposal implements many of the higher-order functions (and allows you to implement others) and does so in a way that you never need a lambda -- you write expressions rather than functions.

Another way to look at it is that there are some common algorithms in programming, such as finding the best element(s) or value(s) from a candidate set, that appear again and again, and this allows us to abstract out these algorithms, rather than re-implement them every time.

One drawback of my proposal is that the [ ... ] syntax suggests that a list is returned, when that need not be the case. My defense is that it is iterating over a list (or at least, an iterable). But Guido pronounced, quite reasonably, that this was not a sufficient defense.

There have been past discussions of set displays and dict displays, using syntax something like:

    words = { w.lower() for w in text.split() }
    squares = { i:i*i for i in range(N) }

I'm not going to take a stand on whether these are good ideas, but I note you can do this with accumulation displays easily:

    words = [Set: w.lower() for w in text.split()]
    squares = [Dict: i, i*i for i in range(N)]

(or of course you can do it without any changes to the language):

    words = set([w.lower() for w in text.split()])
    squares = dict([i, i*i for i in range(N)])

Discussion of my proposal led to the ressurection of PEP 289 for generator comprehensions, slightly updated and now called generator expressions. The idea is that

    exp for x in it

would be roughly equivalent to

    def gen():
        for x in it: 
            yield exp
    return gen()

You get short-circuit evaluation just by ceasing to call the generator. In fact, all the accumulators that I wrote could be re-written as functions that take an generator/iterator/sequence as input (although if you want both the loop variable and the computed value, you need to pass them both as a tuple). Compare:

Accumulation Display Generator Expression plus Function

[Sum: x*x for x in numbers] [Product: Prob_spam(word) for word in email_msg] [Min: temp(hour) for hour in range(24)] [Median: f(x) for x in data] [SortBy: abs(x) for x in (-2, -4, 3, 1)] [Top(10): humor(joke) for joke in jokes] [Argmax: votes(c) for c in candidates]

sum(x*x for x in numbers) product(Prob_spam(word) for word in email_msg) min(temp(hour) for hour in range(24)) median(f(x) for x in data) sortdecorated((abs(x),x) for x in (-2, -4, 3, 1)) top(10, (humor(joke) for joke in jokes)) argmax((c,votes(c)) for c in candidates)

Accumulation Display	Generator Expression plus Function
[Sum: x*x for x in numbers] [Product: Prob_spam(word) for word in email_msg] [Min: temp(hour) for hour in range(24)] [Median: f(x) for x in data] [SortBy: abs(x) for x in (-2, -4, 3, 1)] [Top(10): humor(joke) for joke in jokes] [Argmax: votes(c) for c in candidates]	sum(x*x for x in numbers) product(Prob_spam(word) for word in email_msg) min(temp(hour) for hour in range(24)) median(f(x) for x in data) sortdecorated((abs(x),x) for x in (-2, -4, 3, 1)) top(10, (humor(joke) for joke in jokes)) argmax((c,votes(c)) for c in candidates)

With the accumulation display approach, you need a change to the language, and a new set of accumulators. With the generator expression approach you need a change to the language of similar complexity, and a combination of old funcxtions (sum, min) and new functions (product, top, argmax). The new functions are somewhat easier to write than the accumulators; for example a 6-line some function instead of an 8-line Some accumulator:

    def some(items):
        "If some element in items is true, return it; otherwise return False"
        for item in items:
            if item:
                return item
        return False

Philip J. Eby and Raymond Hettinger proposed the generator expression idea. I'm happy that my proposal came at the right time to help spur a discussion that appears to be on the way to a happy ending--I think the generator expression idea is more generally useful than my proposal.

Other points were raised in the discussion on python-dev. Ian Bicking made a suggestion that I interpreted to mean that accumulation should be implemented as something like:

    def accumulation(acc, f, it):
        a = acc.__accum__()
        for x in it:
            if a.add(f(x), x):
                break         
        return a.result()

The advantage is that min, max, sum and perhaps some others could be given a __accum__ method. I'm not sure of the details of how this would work out.

Next Steps

It appears that accumulation displays will now retire; I won't bother writing up a PEP, but instead will root for PEP 289. For historical purposes you can see

A library of accumulators. I'm thinking there should be an accum module, which users can choose to import or import * from. I have a candidate accum module.

A demonstration of the use of these accumulators. I have one called testaccum which rewrites a limited class of accumulation displays into calls to accumulation and prints the results. It produces this output:

temp =  [70, 70, 71, 74, 76, 76, 72, 76, 77, 77, 77, 78, 78, 79, 79, 79, 78, 80, 82, 83, 83, 81, 84, 83]
data = temp
votes =  {'Gray': 45, 'Arnie': 48, 'Peter': 3, 'Cruz': 32, 'Tom': 13}
candidates =  ['Gray', 'Arnie', 'Peter', 'Cruz', 'Tom']

[Max: temp[hour] for hour in range(24)]
    ==>  84
[Min: temp[hour] for hour in range(24)]
    ==>  70
[Sum: x*x for x in data]
    ==>  144999
[Mean: f(x) for x in data]
    ==>  155.25
[Median: f(x) for x in data]
    ==>  156.0
[Mode: f(x) for x in data]
    ==>  166
[Argmax: votes[c] for c in candidates]
    ==>  Arnie
[Argmin: votes[c] for c in candidates]
    ==>  Peter
[Some: temp[hour] > 75 for hour in range(24)]
    ==>  True
[Every: temp[hour] > 75 for hour in range(24)]
    ==>  False
[Top(10): temp[hour] for hour in range(24)]
    ==>  [84, 83, 83, 83, 82, 81, 80, 79, 79, 79]
[Join(', '): votes[c] for c in candidates]
    ==>  45, 48, 3, 32, 13
[SortBy: abs(x) for x in (-2,-4, 3, 1)]
    ==>  [1, -2, 3, -4]
[SortBy(reverse=True): abs(x) for x in (-2, -4, 3, 1)]
    ==>  [-4, 3, -2, 1]

Changing the grammar would have been rather easy. Where the grammar now has:

listmaker  	 ::=   	expression ( list_for  | ( "," expression )* [","] )

we could change it to:

listmaker  	 ::=   	expression [ ":" expression ] ( list_for  | ( "," expression )* [","] )

Changing the code generator would have been simple for the simple examples shown here, but would require some thought for complex listmakers with multiple for and if clauses. Basically, we want

    [acc: exp for x in it1 if c1 for y in it2 if c2 ...]

to mean

    from itertools import ifilter

    accumulation(acc, lambda (x, y, ...): exp, 
                 icross_product(ifilter(c1, it1), ifilter(c2, it2), ...))

where icross_product takes the cross product of all the argument iterables, in left-to-right order. That is, it generates tuples in the order

(c1[0], c2[0], ... cn[0])
(c1[0], c2[0], ... cn[1])
(c1[0], c2[0], ... cn[2])
...
(c1[0], c2[0], ... cn[-1])
(c1[0], c2[0], ... cn_1[1], cn[0])
(c1[0], c2[0], ... cn_1[1], cn[1])
...
(c1[-1], c2[-1], ... cn[-1])

Perhaps it should be a standard part of itertools.

This document is in the public domain and may be used for any purpose.

Peter Norvig