Monthly Archives: June 2018

Generic Top N Elements In Scala

Getting the top N elements from a list of elements is a common need in applications that involve data retrieval. If the list is big, it’ll be inefficient and wasteful (in terms of processing resource) to sort the entire list when one is interested in only the top few elements.

Consider a list of real numbers (i.e. Double or Float-typed) and let’s say we want to fetch the smallest N numbers from the list. A commonly used algorithm for the task is rather straight forward:

Start with the first N numbers of the list as the selected N-element sublist, then check for each of the remaining numbers of the list and if it’s smaller than the largest number in the sublist swap out in each iteration the largest number with it.

Algorithmic steps

Formulating that as programmatic steps, we have:

  1. Maintain a sorted N-element list in descending order, hence its head is the max in the list (Assuming N isn’t a big number, the cost of sorting is trivial).
  2. In each iteration, if the current element in the original list is smaller than the head element in the N-element list, replace the head element with the current element; otherwise leave the current N-element list unchanged.

Upon completing the iterations, the N-element list will consist of the smallest elements in the original list and a final sorting in ascending order will result in a sorted N-element list:

// Top N (smallest) elements from a Double-type list

def topN(n: Int, list: List[Double]): List[Double] = {
  def maxListHead(l: List[Double], e: Double): List[Double] = {
    if (e < l.head) (e :: l.tail).sortWith(_ > _) else l
  }
  list.drop(n).foldLeft( list.take(n).sortWith(_ > _) )( maxListHead ).
    sortWith(_ < _)
}

val list: List[Double] = List(3, 2, 8, 2, 9, 1, 5, 5, 9, 1, 7, 3, 4)

topN(5, list)
// res1: List[Double] = List(1.0, 1.0, 2.0, 2.0, 3.0)

Note that it would be trivial to modify the method to fetch the largest N numbers (instead of the smallest) in which case one only needs to reverse the inequality operator in ‘e < l.head’, the iterative ‘sorthWith(_ > _)’ and the final ‘sorthWith(_ < _)’.

Refactoring to eliminate sorting

Now, let’s say we’re going to use it to fetch some top N elements where N is a little bigger, like top 5,000 from a 1 million element list. Except for the inevitable final sorting of the sublist, all the other ‘sorthWith()’ operations can be replaced with something less expensive. Since all we care is to be able to conditionally swap out the largest number in the sublist, we just need the largest number to be placed at the head of the sublist and the same algorithmic flow will work fine.

The refactored topN method below replaces all ‘sortWith()’ (except for the final sorting) with ‘bigHead()’ which places the largest number of the input list at its head position:

// Top N (smallest) elements from a Double-type list (refactored)

def topN(n: Int, list: List[Double]): List[Double] = {
  def bigHead(l: List[Double]): List[Double] = list match {
    case Nil => list
    case _ =>
      l.tail.foldLeft( List(l.head) )( (acc, x) =>
        if (x >= acc.head) x :: acc else acc.head :: x :: acc.tail
    )
  }
  def maxListHead(l: List[Double], e: Double): List[Double] = {
    if (e < l.head) bigHead(e :: l.tail) else l
  }
  list.drop(n).foldLeft( bigHead(list.take(n)) )( maxListHead ).
    sortWith(_ < _)
}

Generic top N numeric elements

Next, we generalize method topN to handle any list of elements of the type implicitly associated with Numeric, with a typeclass pattern known as context bound.

// Top N (smallest) elements from a generic numeric list

def topN[T](n: Int, list: List[T])(implicit num: Numeric[T]): List[T] = {
  import num.{mkNumericOps, mkOrderingOps}
  def bigHead(l: List[T]): List[T] = list match {
    case Nil => list
    case _ =>       l.tail.foldLeft( List(l.head) )( (acc, x) =>
        if (x >= acc.head) x :: acc else acc.head :: x :: acc.tail
      )
  }
  def maxListHead(l: List[T], e: T): List[T] = {
    if (e < l.head) bigHead(e :: l.tail) else l
  }
  list.drop(n).foldLeft( bigHead(list.take(n)) )( maxListHead ).
    sortWith(_ < _)
}

val list: List[Double] = List(3, 2, 8, 2, 9, 1, 5, 5, 9, 1, 7, 3, 4)
topN[Double](5, list)
// res1: List[Double] = List(1.0, 1.0, 2.0, 2.0, 3.0)

val list: List[Int] = List(3, 2, 8, 2, 9, 1, 5, 5, 9, 1, 7, 3, 4)
topN[Int](5, list)
// res2: List[Int] = List(1, 1, 2, 2, 3)

With the context bound for type ‘T’, importing the implicit mkNumericOps and mkOrderingOps methods makes arithmetic and comparison operators available for the list elements to be compared and ordered.

Generic top N objects ordered by mapped values

To further generalize topN, rather than being limited to numeric elements we enable it to take a list of generic objects and return the top N elements ordered by the individual element’s corresponding value (e.g. an order-able class field of the object). To accomplish that, we revise topN as follows:

  • Loosen the context bound from ‘Numeric’ to the more generic Ordering so that items can be ordered by non-numeric values such as strings
  • Take as an additional parameter a mapping function that tells what values corresponding to the objects are to be ordered by
// Top N elements from a list of objects ordered by object-mapped values

case class Score[T : Numeric](id: String, rank: T)

def topN[S, T : Ordering](n: Int, list: List[S], f: S => T): List[S] = {
  val orderer = implicitly[Ordering[T]]
  import orderer._
  def bigHead(l: List[S]): List[S] = list match {
    case Nil => list
    case _ =>
      l.tail.foldLeft( List(l.head) )( (acc, x) =>
        if (f(x) >= f(acc.head)) x :: acc else acc.head :: x :: acc.tail
      )
  }
  def maxListHead(l: List[S], e: S): List[S] = {
    if (f(e) < f(l.head)) bigHead((e :: l.tail)) else l
  }
  list.drop(n).foldLeft( bigHead(list.take(n)) )( maxListHead ).
    sortWith(f(_) < f(_))
}

val scores = List(
  Score("a", 3), Score("b", 6), Score("c", 8), Score("d", 1), Score("e", 11),
  Score("f", 5), Score("g", 9), Score("h", 12), Score("i", 2), Score("j", 10)
)
topN(5, scores, (s: Score[Int]) => s.rank)
// res1: List[Score[Int]] = List(Score(d,1), Score(i,2), Score(a,3), Score(f,5), Score(b,6))

Note that the type parameter ‘T : Ordering’, which signifies a context bound, is the shorthand notation for:

// Context Bound
case class Score[T](id: String, rank: T)(implicit orderer: Ordering[T])