The magic of for-yield with Scala collections

Scala has a rich and well-designed collections library. You can find a good overview of the collections library on the Scala documentation website.

In this post I’ll show how using for with yield on collections works and how the collections library is designed and implemented so that it does exactly what you expect it to do.

What for-yield does

You can ofcourse use a for loop to iterate over a collection. If what you want to do is do something with each element of a collection and return a new collection with the transformed elements, you can use yield. Have a look at the following examples.

scala> val xs = List(1, 2, 3, 4)
xs: List[Int] = List(1, 2, 3, 4)

scala> for (x <- xs) yield x * 2
res0: List[Int] = List(2, 4, 6, 8)

scala> val ys = Vector(1, 2, 3, 4)
ys: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4)

scala> for (y <- ys) yield y * 2
res1: scala.collection.immutable.Vector[Int] = Vector(2, 4, 6, 8)

scala> val zs = Set(1, 2, 3, 4)
zs: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4)

scala> for (z <- zs) yield z * 2
res2: scala.collection.immutable.Set[Int] = Set(2, 4, 6, 8)

The interesting thing to see here is this: The type of the result of each of the for loops is the same as the type of the original collection that the loop is iterating over. So, the loop over xs produces a List, the loop over ys produces a Vector and the loop over zs produces a Set.

Let's look at it in a bit more detail. Two things are happening: 1) the static type is preserved: the type of the result of the for is the same as the type of the variable that holds the collection that we're iterating over, and 2) the dynamic type is preserved: the type of the actual collection object is the same as the type of the collection object that we're iterating over.

scala> val xs: Seq[Int] = List(1, 2, 3, 4)
xs: Seq[Int] = List(1, 2, 3, 4)

scala> for (x <- xs) yield x * 2
res0: Seq[Int] = List(2, 4, 6, 8)

Here we have declared the type of xs to be Seq[Int], with List[Int] as the concrete implementation to use. The type of the result of the loop is now also Seq[Int], with List[Int] as the implementation. Would we declare the type of xs to be LinearSeq[Int], while still using List[Int] as the implementation, then the type of the result would be LinearSeq[Int]. And, conversely, would we use a different implementation of Seq, for example Vector, then the type of the actual object would be Vector as well.

scala> import scala.collection.immutable._
import scala.collection.immutable._

scala> val xs: LinearSeq[Int] = List(1, 2, 3, 4)
xs: scala.collection.immutable.LinearSeq[Int] = List(1, 2, 3, 4)

scala> for (x <- xs) yield x * 2
res0: scala.collection.immutable.LinearSeq[Int] = List(2, 4, 6, 8)

scala> val ys: Seq[Int] = Vector(1, 2, 3, 4)
ys: scala.collection.immutable.Seq[Int] = Vector(1, 2, 3, 4)

scala> for (y <- ys) yield y * 2
res1: scala.collection.immutable.Seq[Int] = Vector(2, 4, 6, 8)

The type of the result of a for loop with yield is not always exactly the same as what you are iterating over. For example, if you iterate over a Range, you get back an IndexedSeq with Vector as the concrete type:

scala> val xs = 1 to 4
xs: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4)

scala> for (x <- xs) yield x * 2
res0: scala.collection.immutable.IndexedSeq[Int] = Vector(2, 4, 6, 8)

The type of the result can even depend on the type of the function that is performed on the elements. In the following example, the first for returns another Map[Int, String] but the second for returns a List[Int].

scala> val map = Map(1 -> "one", 2 -> "two", 3 -> "three")
map: scala.collection.immutable.Map[Int,String] = Map(1 -> one, 2 -> two, 3 -> three)

scala> for ((k, v) <- map) yield (k, v.toUpperCase)
res0: scala.collection.immutable.Map[Int,String] = Map(1 -> ONE, 2 -> TWO, 3 -> THREE)

scala> for ((k, v) <- map) yield k
res1: scala.collection.immutable.Iterable[Int] = List(1, 2, 3)

Below, we'll dive deep into this example to find out exactly how this happens. Before we do that, let's look at how for with yield works.

How for-yield works

Scala's for loop is really just syntactic sugar. The compiler transforms a for into a combination of calls to four methods on the thing that it's iterating over: map, withFilter, flatMap, foreach. Section 6.19 of the Scala Language Specification explains exactly how this transformation is done. A simple for with yield as we have used in the examples above is transformed to a single call to map.

scala> val xs = Seq(1, 2, 3, 4)
xs: Seq[Int] = List(1, 2, 3, 4)

scala> for (x <- xs) yield x * 2 // This is exactly the same as...
res0: Seq[Int] = List(2, 4, 6, 8)

scala> xs.map(x => x * 2) // ...this
res1: Seq[Int] = List(2, 4, 6, 8)

If that's how it works, then the mystery is already solved, and at first sight there's not much to it. It just means that each collection type has an implementation of map that returns the expected type.

For example, List[A] should have a method def map[B](f: (A) => B): List[B], and Vector[A] should have a method def map[B](f: (A) => B): Vector[B].

It also explains why the type of the original variable is preserved, if for example trait Seq[A] has a method declaration def map[B](f: (A) => B): Seq[B] - because of covariant return types the type of the result would be Seq if you'd call it on a variable of type Seq with any concrete implementation of Seq.

But the collections library is implemented in a way that's more sophisticated than simply having the appropriate map method in each type of collection. It's interesting to have a look how the collections library is designed and implemented in a way that avoids having to implement map and many other methods in a way that avoids code duplication.

The design of Scala collections

I'm not going to describe everything there is to the design of the Scala collections library here. I want to highlight one particular aspect that explains how we get the right kind of collection as shown above.

There are many different collection traits and interfaces in Scala's collection library, and there's a series of methods that all of the collections implement. For example, all collections have methods foreach, map, flatMap, filter, head, tail, take, drop and many others.

The implementers of the collections library could simply have implemented all these methods in all the different collection classes, but that would mean there would be a lot of copy-and-paste code, because most of those methods do almost the same in the different classes. What they did instead was implement these methods just once, in trait TraversableLike, which all the other collection classes extend.

The common methods in trait TraversableLike are implemented in terms of two methods: foreach and newBuilder. This means that any collection that should have all of the common methods just has to extend TraversableLike and provide implementations for foreach and newBuilder, and then it will get all the other methods for free. This is an example of the template method pattern.

The foreach method simply has to call a function on each element of the collection. The newBuilder method returns a Builder for creating a new collection of the same type and with the same element type.

But just these two methods are not enough to explain how you can get back a Vector when you iterate over a Range, or how the type of the result when iterating over a Map can be different depending on just the type of the transformation function as we saw in the last example of the first section above.

To see how this is achieved, let's first look at the signature of the method map in trait TraversableLike.

trait TraversableLike[+A, +Repr] extends ... {
  // ...
  def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That
  // ...
}

This is what the type arguments mean:

  • A is the type of the elements of the collection.
  • Repr ("representation") is the type of the implementation of the collection.
  • B is the type of the elements of the result of applying map.
  • That is the type of the collection of the result of applying map.

So you call map on a collection of type Repr with elements of type A, you pass it a function f that takes an A an returns a B, and map then returns a That containing elements of type B.

What's interesting here is the implicit CanBuildFrom parameter. This is where the flexibility comes from that allows collections to return different types of results. The CanBuildFrom trait looks like this:

trait CanBuildFrom[-From, -Elem, +To] {
  def apply(from: From): Builder[Elem, To]
  def apply(): Builder[Elem, To]
}

It declares two apply methods, of which especially the first one is interesting. It takes a From and returns a Builder to build a To with elements of type Elem. In other words, a CanBuildFrom is a factory for Builder objects that, from one kind of collection, can build another kind of collection.

Collection implementations can implement CanBuildFrom objects with specific type arguments for producing the desired type of result and make sure they are in scope, so that Scala's implicit resolution mechanism finds them and passes them when map is called on the collection. The map method implemented in TraversableLike will then call apply on the CanBuildFrom, which will return a Builder to build the desired type of result.

It works in a similar way for other methods that return another collection, such as filter.

A deeper look

Let's look at the last example of the first section again and see what exactly happens under the hood.

scala> val map = Map(1 -> "one", 2 -> "two", 3 -> "three")
map: scala.collection.immutable.Map[Int,String] = Map(1 -> one, 2 -> two, 3 -> three)

scala> for ((k, v) <- map) yield (k, v.toUpperCase)
res0: scala.collection.immutable.Map[Int,String] = Map(1 -> ONE, 2 -> TWO, 3 -> THREE)

scala> for ((k, v) <- map) yield k
res1: scala.collection.immutable.Iterable[Int] = List(1, 2, 3)

The first example results in a Map[Int, String] and the second one in a List[Int]. How does that happen?

The for with yield loops are transformed into calls to map.

The method map is implemented in trait TraversableLike. It calls apply on the implicit CanBuildFrom argument, which creates a Builder to build the result. Then it calls foreach to loop over all key-value pairs of the map. In each iteration, it applies the function f (whatever comes after the yield in the example) to the current element, and passes the result to the Builder. Finally it returns the result of the builder.

The difference in the result type of the two examples above happens because a different CanBuildFrom is resolved by the implicit resolution mechanism.

The first example

In the first example, the CanBuildFrom is found in the companion object of Map, which contains the following implicit method:

object Map extends ImmutableMapFactory[Map] {
  // ...
  implicit def canBuildFrom[X, Y]: CanBuildFrom[Coll, (X, Y), Map[X, Y]] =
    new MapCanBuildFrom[X, Y]
  // ...
}

(Note that I've renamed the type parameters from the actual source code, to make it less confusing later on).

By looking at the type arguments of the CanBuildFrom that this method returns, we can see why it applies here. Remember that CanBuildFrom takes three type parameters: From, Elem and To.

  • From is Coll, which is a type alias for Map[_, _] (defined in GenMapFactory).
  • Elem is (X, Y), a tuple of a key and value of the resulting map.
  • To is Map[X, Y] - the result will be a Map with keys of type X and values of type Y.

Now, look back at the declaration of the method map in TraversableLike.

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That

Note that A is a type parameter of trait TraversableLike and refers to the type of the elements of the collection. A Map is a TraversableLike where the elements are the key-value pairs of the map. If the types of the keys and values in the Map are K and V respectively, then A for a Map means (K, V) (a tuple of K and V).

Replacing A and B in the declaration of the map method with the type arguments of trait Map and of the CanBuildFrom that the implicit method canBuildFrom returns, we get the following.

def map[...](f: (K, V) => (X, Y))
      (implicit bf: CanBuildFrom[Map[K, V], (X, Y), Map[X, Y]]): Map[X, Y]

Note that in the example, the function f takes a tuple (Int, String) and also returns a tuple (Int, String). So f matches what map requires where K is Int, V is String, X is Int and Y is String.

As you see the CanBuildFrom returned by the implicit method canBuildFrom in the Map companion object is specifically meant to transform a Map into another Map, which is exactly what we did in the first example.

The second example

In the second example, the function that we pass to map does not return a tuple, but an Int. The CanBuildFrom that was used by the first example isn't applicable here, so the Scala compiler has to look further to find another CanBuildFrom to use.

The Map companion object doesn't have another specific CanBuildFrom that is applicable in this case, so the Scala compiler is going to look in the supertypes of Map. The type hierarchy of Map is quite complicated, so by looking at the source code it's hard to see where the CanBuildFrom comes from for this case. It comes from the companion object of trait Iterable, which has an implicit canBuildFrom method similar to the one in the companion object of Map. The fact that this comes from Iterable explains why the static type of the result of the second example is Iterable.

There are a few more steps to see the why the actual object is a List.

The canBuildFrom in Iterable returns an instance of class GenericCanBuildFrom which is defined in GenTraversableFactory. GenericCanBuildFrom delegates creating the Builder to the method genericBuilder of the original collection.

Map doesn't have its own implementation of genericBuilder. It inherits this method from GenericTraversableTemplate, which indirectly calls newBuilder in the companion object of Iterable. That method returns a ListBuffer, which is a Builder for List, which explains why the result is a List.

Conclusion

The design and implementation of the Scala collections library is interesting to study. It contains interesting mechanisms and you can learn a lot by studying the source code.

Implicits are a powerful feature, but they can also make code hard to understand, because it's often hard to see which implicit value applies to a specific case - especially in the collections library, which consists of dozens of traits and classes which are all interconnected.

References

You may also like...