Splitting Iterator into Iterator and Iterator

Question

I would like to split the output of an object that implements Iterator<(A,B)> into two objects that implement Iterator<A> and Iterator<B>. Since one of the outputs could be iterated more than the other, I'll need to buffer up the output of the Iterator<(A,B)> (because I can't rely on the Iterator<(A,B)> being cloneable.) The problem is that the iterator could be infinite, so I can't simply collect the output of the iterator into two buffers and return iterators over the two buffers.

So it seems that I'll need to hold buffers of the A and B objects, and whenever one of the buffers is empty I'll fill it with samples from the Iterator<(A,B)> object. This means that I'll need two iterable structs that have mutable references to the input iterator (since both of them will need to call next() on the input to fill up the buffers), which is impossible.

So, is there any way to accomplish this in a safe way?

huon · Accepted Answer

This is possible. As you identified you need mutable references to the base iterator from both handles, which is possible using a type with "internal mutability", that is, one that uses unsafe code internally to expose a safe API for acquiring a &mut to aliasable data (i.e. contained in a &) by dynamically enforcing the invariants that the compiler normally enforces at compile time outside unsafe.

I'm assuming you're happy to keep the two iterators on a single thread¹, so, in this case, we want a RefCell. We also need to be able to have access to the RefCell from the two handles, entailing storing either a &RefCell<...> or an Rc<RefCell<...>>. The former would be too restrictive, as it would only allow us to use the pair of iterators in and below the stack frame in which the RefCell is created, while we want to be able to freely pass the iterators around, so Rc it is.

In summary, we're basically going to be storing an Rc<RefCell<Iterator<(A,B)>>>, there's just the question of buffering. The right tool for the job here is a RingBuf since we want efficient push/pop at the front and back. Thus, the thing we're sharing (i.e. inside the RefCell) could look like:

struct SharedInner<A, B, It> {
    iter: It,
    first: RingBuf<A>,
    second: RingBuf<B>,
}

We can abbreviate the type actually being shared as type Shared<A, B, It> = Rc<RefCell<SharedInner<A, B, It>>>;, which allows us to define the iterators:

struct First<A, B, It> {
    data: Shared<A, B, It>
}
impl Iterator<A> for First<A,B,It> {
    fn next(&mut self) -> Option<A> {
        // ...
    }
}

To implement next the first thing to do is get a &mut to the SharedInner, via self.data.borrow_mut();. And then get an element out of it: check the right buffer, or otherwise get a new element from iter (remembering to buffer the left-over B):

let mut inner = self.data.borrow_mut();

inner.first.pop_front().or_else(|| {
    inner.iter.next().map(|(a,b)| {
        inner.second.push(b);
        a
    })
})

Docs: RingBuf.pop_front, Option.or_else.

The iterator for the other side is similar. In total:

use std::cell::RefCell;
use std::collections::{Deque, RingBuf};
use std::rc::Rc;

struct SharedInner<A, B, It> {
    iter: It,
    first: RingBuf<A>,
    second: RingBuf<B>
}

type Shared<A, B, It> = Rc<RefCell<SharedInner<A, B, It>>>;

struct First<A, B, It> {
    data: Shared<A, B, It>
}

impl<A,B, It: Iterator<(A,B)>> Iterator<A> for First<A, B, It> {
    fn next(&mut self) -> Option<A> {
        let mut inner = self.data.borrow_mut();

        // try to get one from the stored data
        inner.first.pop_front().or_else(|| 
            // nothing stored, we need a new element.
            inner.iter.next().map(|(a, b)| {
                inner.second.push(b);
                a
            }))
    }
}

struct Second<A, B, It> {
    data: Shared<A, B, It>
}
impl<A,B, It: Iterator<(A,B)>> Iterator<B> for Second<A,B,It> {
    fn next(&mut self) -> Option<B> {
        let mut inner = self.data.borrow_mut();

        inner.second.pop_front().or_else(|| {
            inner.iter.next().map(|(a, b)| {
                inner.first.push(a);
                b
            })
        })
    }
}

fn split<A, B, It: Iterator<(A,B)>>(it: It) -> (First<A, B, It>, 
                                                Second<A, B, It>) {
    let data = Rc::new(RefCell::new(SharedInner { 
        iter: it,
        first: RingBuf::new(),
        second: RingBuf::new(),
    }));

    (First { data: data.clone() }, Second { data: data })
}

fn main() {
    let pairs = range(1u32, 10 + 1).map(|x| (x, 1.0 / x as f64));

    let (mut first, mut second) = split(pairs);

    println!("first:");
    for x in first.by_ref().take(3) {
        println!("  {}", x);
    }

    println!("second:");
    for y in second.by_ref().take(5) {
        if y < 0.2 { break }
        println!("  {}", y);
    }

    let a = first.collect::<Vec<u32>>();
    let b = second.collect::<Vec<f64>>();

    println!("a {}
b {}", a, b);
}

which prints

first:
  1
  2
  3
second:
  1
  0.5
  0.333333
  0.25
  0.2
a [4, 5, 6, 7, 8, 9, 10]
b [0.166667, 0.142857, 0.125, 0.111111, 0.1]

playpen.

There's various ways this could be optimised, e.g. when fetching in First, only buffer the left-over B if a Second handle exists.

¹ If you were looking to run them in separate threads just replace the RefCell with a Mutex and the Rc with an Arc, and add the necessary bounds.

Splitting Iterator<(A,B)> into Iterator<A> and Iterator<B>

Tags:

rust

awelkie

1 Answers

huon

Recent Activity

Donate For Us