Chaining a sequence of generators

I often gravitate towards solutions using a series of chained generators, in the style of David Beazley’s ‘Generator Tricks for Systems Programmers.’

This results in the outer level of my code calling one generator after another, terminating in something that consumes the rows, pulling data one row at a time through each of the generators:

inputRows = read()
parsedRows = parse(inputRows)
processedRows = process(parsedRows)
outputRows = format_(processedRows)

where each called function except the last is actually a generator, e.g:

def parse(rows):
    for row in rows:
        yield int(row)

This is great. But my itch is that the top level code above is a bit wordy, given that what it does is so simple. The reader has to check each temporary variable quite carefully to be sure it’s doing the right thing.

Fowler’s ‘Refactoring’ describes circumstances when it’s good to remove intermediate variables, which results in:

output( format_( process( parse( read() ) ) ) )

This is certainly less wordy, and expresses what’s happening very directly, but it annoys some of my colleagues that the called functions are listed in reverse order from what one might intuitively expect.

I’ve had this idea in my head to create a decorator for generators which allows one to chain them in an intuitive order, possibly using some unconventional notation such as:

read() | parse | process | format_ | output

where ‘parse’, et al, are now decorated with ‘@chainable’ or somesuch, which returns an instance of a class that stores the wrapped generator, and overrides __or__ to do its magic. Maybe ‘read’ doesn’t need to be invoked manually there at the start of the chain. I haven’t really thought this through.

Luckily, before embarking on that, I realised today I’ve been over-complicating the whole thing. There’s no need for decorators, nor for the cute ‘|’ syntax. I just need a plain old function:

def link(source, *transforms):
    args = source
    for transform in transforms:
        args = transform(args)
    return args

Update: This code has been improved thanks to suggestions in the comments from Daniel Pope (eliminate the ‘first’ variable) and Xtian (take an iterable rather than a callable for the source.)

This assumes the first item passed to link is an iterable, and each subsequent item is a generator that takes the result of the item before.

If the final item in the sequence passed to ‘link’ is a generator, then this returns a generator which is the composite of all the ones passed in:

for item in link(read(), parse, process, format_):
    print item

Or if the final item passed to ‘link’ is a regular function, which consumes the preceding generators, then calling ‘link’ will invoke the generators, i.e. the following is the same as the above ‘for’ loop:

link(read(), parse, process, format_, output)

There’s some rough edges, such as determining what to do if different generators require other args. Presumably ‘partial’ could help here. But in general, ‘link’ only needs to be written once, and I’m liking it.

8 thoughts on “Chaining a sequence of generators

  1. Agreeed all over. It initially bugged me to break uniformity like that, but on reflection that’s totally irrational of me. What you suggest is way more general and more useful. Post updated!

  2. (I’m glad you changed it from the original “annoys some of my curmudgeonly colleagues”.)

    I like it! Although, I think I’d like it more if it took an iterator as its first argument, rather than a callable that returns an iterator. Then you could relax the restriction that the source callable takes no arguments. It makes it slightly less uniform, but I think the source of a pipeline is always special.

    Plus that would mean you could pass existing but not yet processed generators into it.

  3. Ha! Brilliant! I *knew* someone must have implemented that already, but couldn’t find it. Very interesting to see, thanks for that.

  4. Yeah, it’s a reasonable point, but one I hear debated back and forth. Some people don’t like this style because it breaks their general rule to not reuse existing variables for new things. I generally feel like that’s a good rule, but could be persuaded to break it in this case. I definitely agree that your example is better than the original snippet from my post. But I *still* think it’s too wordy, with too many needless temporaries. :-)

  5. I would reuse the same temporary variable:

    rows = read()
    rows = parse(rows)

    No careful checking necessary.

  6. Using a control boolean like ‘first’ always looks like a smell. There’s always a better way. In this case you can just write the function as

    def link(source, *filters):
    g = source()
    for f in filters:
    g = f(g)
    return g

Leave a Reply