(Status: this proposal will not go forward, but helped spark a discussion on a better idea--generator expressions.)
List displays (aka list comprehensions) have proven to be a useful and popular part of Python. However, there are limitations on what they can do, and they sometimes lead to inefficiencies: it is tempting to create a list of intermediate results, even when the whole list is not really needed. Consider finding the maximum termperature over the last day, where you have a function, temp, that takes an hour as argument and fetches the temperature. With list comprehensions this is:
max([temp(hour) for hour in range(24)])This builds up the list of temperatures, only to discard them. Not so bad in this case, but troublesome in cases when the list is multiple megabytes. In this note I propose a new syntax called accumulation displays that combines the accumulating operator (in this case max) with the iteration over a collection of values. Accumulation displays are intended to be more efficient in some cases (because no intermediate list is constructed), more extensible (you can do things beyond max, min, and sum that you just can't do by operating on a returned list), and very natural to read. An accumulation display for the example above is:
[Max: temp(hour) for hour in range(24)]I use "Max" rather than "max" because I want the accumulator to be different from the builtin function (they could be made the same, but I prefer not to for reasons we will soon see). Here are some more examples:
[Sum: x*x for x in numbers] [Product: Prob_spam(word) for word in email_msg] [Min: temp(hour) for hour in range(24)] [Mean: f(x) for x in data] [Median: f(x) for x in data] [Mode: f(x) for x in data] [SortBy: abs(x) for x in (-2, -4, 3, 1)]
"[" expression ":" listmaker "]"where listmaker is everything that can go into a list display. (See http://www.python.org/doc/current/ref/lists.html.)
[Sum: x*x for x in numbers]we want that to be the equivalent of
total = 0 for x in it: total = (x*x + total) return totalThat seems simple enough: we initialize a value, update it in the loop, and then return it at the end. But it is not always quite that easy. For example, in
[Mean: f(x) for x in data]we want
(total, n) = (0, 0) for x in it: (total, n) = (total+x, n+1) return total / nIn this case we need some additional computation at the end, and the value we are updating is more complex (a two-element tuple rather than a single number). (And we have assumed true division.) We need to make some choices: let us say that an accumulator (such as Sum or Mean) is a class or other callable object which when called with no arguments returns an object with proper initial state, and that we then call the object's add method to accumulate values in the loop, and its result method to get the final result. Rather than showing the expansion of the accumulation display code into a sequence of statements, we will show it as a function call. That is,
[acc: exp for x in it]is equivalent to
accumulation(acc, lambda x: exp, it)where we have defined
def accumulation(acc, f, it): a = acc() for x in it: a.add(f(x)) return a.result()We can now define Mean as
class Mean: def __init__(self): self.total, self.n = 0, 0 def add(self, exp): self.total, self.n = self.total+exp, self.n+1 def result(self): return self.total / self.n
[Argmax: votes(c) for c in candidates](In math "argmax" (or "arg max") gives you the element of a set with the maximal value on some function (whereas "max" gives you the maximal function value itself).) Here we are asking for the element c with the maximal value of votes(c). We can program this by adding the three characters ", x" to the definition of accumulation:
def accumulation(acc, f, it): a = acc() for x in it: a.add(f(x), x) return a.result()
[Some: temp(hour) > 30 for hour in range(24)]We want this to return a true value if there is some hour with temp(hour) > 30, and False if there is not. Obviously it can stop iterating when it finds the first element that satisfies the condition. We will handle this by having the accumulator's add method return a true value if it wants to stop the loop. After changing accumulation
def accumulation(acc, f, it): a = acc() for x in it: if a.add(f(x), x): break return a.result()we can define Some as:
class Some: def __init__(self): self.value = False def add(self, exp, _): if exp: self.value = True return StopIteration def result(self): return self.valueNote I returned StopIteration (rather than, say, True) just as a hint to the reader. (I considered having the protocol raise StopIteration rather than just returning it, but I think it is simpler just to return a value.) Note that we still call the result method, even when we have broken out of the loop. If we had the add method return the final result for the accumulation it would work and be slightly simpler for Some, but wouldn't work on an accumulator that needs to exit early and then return None as its final result.
[Top(10): humor(joke) for joke in jokes]Up to now, accumulators were passed no arguments when they were initialized, implicitly, by the function accumulation. Here we have an accumulator, Top, that takes a parameter, and all we have to do is change accumulation so that it can accept either an accumulator class or an instance:
def accumulation(acc, f, it): if isinstance(acc, (types.ClassType, types.TypeType)): a = acc() else: a = acc for x in it: if a.add(f(x), x): break return a.result()Now we can define Top:
from heapq import * class Top: def __init__(self, k): self.k = k self.heap = [] def add(self, exp, _): heap = self.heap if len(heap) < self.k: heappush(heap, exp) elif exp < heap.biggest(): heapreplace(heap, exp) def result(self): return self.heapNote that this can be much more efficient than alternatives such as:
jokes.sort(lambda a, b: humor(a) < humor(b)) jokes[0:10]
a = Top(10) for j in new_jokes: print j a.add(float(raw_input("Rating?")), j) return [a: humor(j) for j in machine_ratable_jokes]Here we need a print statement, so we explicitly manage the accumulator a, filling it with results derived by asking the user for a rating, and then we use the same accumulator in a display to add in some more results and return the final list of top 10 results.
Another way to look at it is that there are some common algorithms in programming, such as finding the best element(s) or value(s) from a candidate set, that appear again and again, and this allows us to abstract out these algorithms, rather than re-implement them every time.
One drawback of my proposal is that the [ ... ] syntax suggests that a list is returned, when that need not be the case. My defense is that it is iterating over a list (or at least, an iterable). But Guido pronounced, quite reasonably, that this was not a sufficient defense.
There have been past discussions of set displays and dict displays, using syntax something like:
words = { w.lower() for w in text.split() } squares = { i:i*i for i in range(N) }I'm not going to take a stand on whether these are good ideas, but I note you can do this with accumulation displays easily:
words = [Set: w.lower() for w in text.split()] squares = [Dict: i, i*i for i in range(N)](or of course you can do it without any changes to the language):
words = set([w.lower() for w in text.split()]) squares = dict([i, i*i for i in range(N)])Discussion of my proposal led to the ressurection of PEP 289 for generator comprehensions, slightly updated and now called generator expressions. The idea is that
exp for x in itwould be roughly equivalent to
def gen(): for x in it: yield exp return gen()You get short-circuit evaluation just by ceasing to call the generator. In fact, all the accumulators that I wrote could be re-written as functions that take an generator/iterator/sequence as input (although if you want both the loop variable and the computed value, you need to pass them both as a tuple). Compare:
Accumulation Display | Generator Expression plus Function |
---|---|
[Sum: x*x for x in numbers] [Product: Prob_spam(word) for word in email_msg] [Min: temp(hour) for hour in range(24)] [Median: f(x) for x in data] [SortBy: abs(x) for x in (-2, -4, 3, 1)] [Top(10): humor(joke) for joke in jokes] [Argmax: votes(c) for c in candidates] |
sum(x*x for x in numbers) product(Prob_spam(word) for word in email_msg) min(temp(hour) for hour in range(24)) median(f(x) for x in data) sortdecorated((abs(x),x) for x in (-2, -4, 3, 1)) top(10, (humor(joke) for joke in jokes)) argmax((c,votes(c)) for c in candidates) |
With the accumulation display approach, you need a change to the language, and a new set of accumulators. With the generator expression approach you need a change to the language of similar complexity, and a combination of old funcxtions (sum, min) and new functions (product, top, argmax). The new functions are somewhat easier to write than the accumulators; for example a 6-line some function instead of an 8-line Some accumulator:
def some(items): "If some element in items is true, return it; otherwise return False" for item in items: if item: return item return False
Philip J. Eby and Raymond Hettinger proposed the generator expression idea. I'm happy that my proposal came at the right time to help spur a discussion that appears to be on the way to a happy ending--I think the generator expression idea is more generally useful than my proposal.
Other points were raised in the discussion on python-dev. Ian Bicking made a suggestion that I interpreted to mean that accumulation should be implemented as something like:
def accumulation(acc, f, it): a = acc.__accum__() for x in it: if a.add(f(x), x): break return a.result()The advantage is that min, max, sum and perhaps some others could be given a __accum__ method. I'm not sure of the details of how this would work out.
temp = [70, 70, 71, 74, 76, 76, 72, 76, 77, 77, 77, 78, 78, 79, 79, 79, 78, 80, 82, 83, 83, 81, 84, 83] data = temp votes = {'Gray': 45, 'Arnie': 48, 'Peter': 3, 'Cruz': 32, 'Tom': 13} candidates = ['Gray', 'Arnie', 'Peter', 'Cruz', 'Tom'] [Max: temp[hour] for hour in range(24)] ==> 84 [Min: temp[hour] for hour in range(24)] ==> 70 [Sum: x*x for x in data] ==> 144999 [Mean: f(x) for x in data] ==> 155.25 [Median: f(x) for x in data] ==> 156.0 [Mode: f(x) for x in data] ==> 166 [Argmax: votes[c] for c in candidates] ==> Arnie [Argmin: votes[c] for c in candidates] ==> Peter [Some: temp[hour] > 75 for hour in range(24)] ==> True [Every: temp[hour] > 75 for hour in range(24)] ==> False [Top(10): temp[hour] for hour in range(24)] ==> [84, 83, 83, 83, 82, 81, 80, 79, 79, 79] [Join(', '): votes[c] for c in candidates] ==> 45, 48, 3, 32, 13 [SortBy: abs(x) for x in (-2,-4, 3, 1)] ==> [1, -2, 3, -4] [SortBy(reverse=True): abs(x) for x in (-2, -4, 3, 1)] ==> [-4, 3, -2, 1]
listmaker ::= expression ( list_for | ( "," expression )* [","] )we could change it to:
listmaker ::= expression [ ":" expression ] ( list_for | ( "," expression )* [","] )
[acc: exp for x in it1 if c1 for y in it2 if c2 ...]to mean
from itertools import ifilter accumulation(acc, lambda (x, y, ...): exp, icross_product(ifilter(c1, it1), ifilter(c2, it2), ...))where icross_product takes the cross product of all the argument iterables, in left-to-right order. That is, it generates tuples in the order
(c1[0], c2[0], ... cn[0]) (c1[0], c2[0], ... cn[1]) (c1[0], c2[0], ... cn[2]) ... (c1[0], c2[0], ... cn[-1]) (c1[0], c2[0], ... cn_1[1], cn[0]) (c1[0], c2[0], ... cn_1[1], cn[1]) ... (c1[-1], c2[-1], ... cn[-1])Perhaps it should be a standard part of itertools.
This document is in the public domain and may be used for any purpose.