Tag Archives: memoization

String Pattern Matching Implementation

Regardless of the programming platform one is on, string pattern matching is often needed in various use cases, particularly as a filtering condition for data retrieval. Regex, short for regular expression, has been one of the most common and versatile string pattern matching methods. Then, there is also the more simplistic wildcard string pattern style.

In this blog post, we’re going to implement in Scala the wildcard pattern matching as well as a minified version of regex pattern matching. On top of that, we’ll also come up with their dynamic programming version by means of memoization for better performance.

Recursion will be the first approach to implementing the string pattern matching methods. Why? That’s because algorithmically it’s probably the crispest way to reason about the implementation logic. There is also the benefit that in many cases the more performant dynamic programming variants can be formulated based on the recursion logic.

Wildcard pattern matching via recursion

Wildcard pattern matching consists of two special characters ? and *, with ? matching any single character and * matching zero or more characters.

For example, wildcard pattern “a*b?c” would match “axybzc” and “abbc”, but not “axyzbc” or “aczc”.

// Wildcard pattern matching implementation via recursion in Scala
def wildcardMatch(s: String, p: String): Boolean = {
  if (s.isEmpty)
    p.isEmpty || p.forall(_ == '*')
  else {
    if (p.isEmpty)
      false
    else {
      if (p(0) == '*')
        wildcardMatch(s, p.substring(1)) || wildcardMatch(s.substring(1), p)
      else
        (p(0) == '?' || p(0) == s(0)) &&
          wildcardMatch(s.substring(1), p.substring(1))
    }
  }
}
// wildcardMatch("aabbbcddddd", "a*bcdd?dd") // true
// wildcardMatch("aabbbcddddd", "a*cdd?d") // false

As with any recursive code, we first come up with the “exit” strategy which in this case is that when the iterative parsing of the input string s is exhausted, the iterative interpreting of the wildcare pattern p must be either also exhausted or left with only * characters.

On the core recursion logic that evaluates pattern matching character *:

wildcardMatch(s, p.substring(1)) || wildcardMatch(s.substring(1), p)

if the current character of the pattern string p being iteratively parsed is * the algorithm will make sure — in a recursive manner — either it consumes the wildcare character to match zero character from the current character of the input string s, or it matches the current character of s and continues to use the wildcare character to evaluate for the next character of s.

As for the evaluation of pattern matching character ?:

(p(0) == '?' || p(0) == s(0)) &&
  wildcardMatch(s.substring(1), p.substring(1))

if the current character of the pattern string p being parsed is ?, it is no different from evaluating whether the current character of p is the same as that of s and the remaining substrings of p and s still match.

Wildcard pattern matching via dynamic programming

Despite the vast difference in approach between recursion and dynamic programming, it’s interesting that how the programming logic being formulated in the recursive version can be “borrowed” for the dynamic programming version. If one looks closely at the following dynamic programming implementation, its logic resembles a great deal of the recursive version’s.

// Wildcard pattern matching implementation via dynamic programming in Scala
def wildcardMatchDP(s: String, p: String): Boolean = {
  val n = s.length
  val m = p.length
  val dp: Array[Array[Boolean]] = Array.fill(n+1, m+1)(false)
  dp(0)(0) = true
  for (j <- 1 to m) {
    if (p(j-1) == '*')
      dp(0)(j) = dp(0)(j-1)
  }
  for (i <- 1 to n) {
    for (j <- 1 to m) {
      if (p(j-1) == '*')
        dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
      else
        dp(i)(j) = (p(j-1) == '?' || p(j-1) == s(i-1)) && dp(i-1)(j-1)
    }
  }
  dp(n)(m)
}
// wildcardMatchDP("aabbbcddddd", "a*bcdd?dd") // true
// wildcardMatchDP("aabbbcddddd", "a*cdd?d") // false

To use the programming logic from the recursive version for dynamic programing, we maintain a 2-dimensional array dp (with dimension n+1 x m+1, where n is length of s and m is length of p) and have dp(i)(j) represent whether we have a wildcard match with only the i right-most characters in input string s and j right-most characters in pattern string p. For instance, dp(0)(1) would represent whether we have a match with an empty s and a p with only its right-most character.

The exit strategy in the recursive code that says “when the iterative parsing of s finishes, either the parsing of p is also finished or the remaining characters in p must all be *” now becomes the initialization of array dp:

dp(0)(0) = true
for (j <- 1 to m) {
  if (p(j - 1) == '*')
    dp(0)(j) = dp(0)(j - 1)
}

For the core programming logic implementation, we iteratively build, bottom up, from the initialized array by computing dp(i)(j) values with the same reasoning of how the * and ? wildcard matching works:

for (i <- 1 to n) {
  for (j <- 1 to m) {
    if (p(j-1) == '*')
      dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
    else
      dp(i)(j) = (p(j-1) == '?' || p(j-1) == s(i-1)) && dp(i-1)(j-1)
  }
}
dp(n)(m)

It should be noted that one could come up similar programming logic for the dynamic programming implementation without referencing the recursive version at all. The above exercise is just an interesting showcase that two seemingly disparate programming methods could do their job using the same logic.

Performance wise, obviously the dynamic programming version with a time complexity of O(nm) is significantly better than the recursive version which is at the scale of O((n+m)2^(n+m)).

Regex pattern matching via recursion

Regex is a comprehensive pattern matching toolset equipped with a suite of versatile methods. We’ll be implementing only two of the most commonly used patterns . and *, with . matching any single character and * matching zero or more contiguous occurrences of the preceding character. In case the character preceding * is ., it’ll match zero or any combinations of characters.

For example, regex pattern “ab.c” would match “aaabzc” and “abbc”, but not “axybzc”. And “a.b” would match “axyzb” and “ab”.

// Regex pattern matching implementation via recursion in Scala
def regexMatch(s: String, p: String): Boolean = {
  def matchOne: Boolean = p(0) == '.' || p(0) == s(0)
  if (p.isEmpty)
    s.isEmpty
  else {
    if (p.length > 1 && p(1) == '*')
      regexMatch(s, p.substring(2)) || (matchOne && regexMatch(s.substring(1), p))
    else
      matchOne && regexMatch(s.substring(1), p.substring(1))
  }
}
// regexMatch("aabbbcddddd", "a.*bcdd.dd") // true
// regexMatch("aabbbcddddd", "a*bcdd.dd") // false
// regexMatch("aabbbcddddd", "a.*bcd*") // true

Similar to the recursive wildcard pattern matching implementation, we come up with the exist strategy that the iterative parsing of the input string s and the iteratively interpreting of the regex pattern p must be exhausted simultaneously.

Like wildcard’s ?, regex’s . has exactly the same matching pattern. Since pattern matching a single character will be used more than once (as can be seen shortly), we create a simple method matchOne:

def matchOne: Boolean = p(0) == '.' || p(0) == s(0)

Contrary to iterating s in the main recursive loop in the wildcard matching implementation, we iterate p for the convenience of evaluating regex’s * that always works with its preceding character — hence the need for singling out the case of p.length > 1.

if (p.length > 1 && p(1) == '*')
  regexMatch(s, p.substring(2)) || (matchOne && regexMatch(s.substring(1), p))
else
  matchOne && regexMatch(s.substring(1), p.substring(1))

Here, the main recursive logic is similar to that of the wildcard implementation, except that the character preceding * is now an integral element for the * pattern matching.

Regex pattern matching via dynamic programming

Next, we come up with the dynamic programming version by “borrowing” the programming logic, again, from the recursive version:

// Regex pattern matching implementation using dynamic programming in Scala
def regexMatchDP(s: String, p: String): Boolean = {
  val n = s.length
  val m = p.length
  val dp: Array[Array[Boolean]] = Array.fill(n+1, m+1)(false)
  dp(0)(0) = true
  for (i <- 1 to n) {
    for (j <- 1 to m) {
      val matchOne = p(j-1) == '.' || p(j-1) == s(i-1)
      if (j > 1 && p(j-2) == '*')
        dp(i)(j) = dp(i)(j-2) || (matchOne && dp(i-1)(j))
      else {
        if (p(j-1) == '*')
          dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
        else
          dp(i)(j) = matchOne && dp(i-1)(j-1)
      }
    }
  }
  dp(n)(m)
}
// regexMatchDP("aabbbcddddd", "a.*bcdd.dd") // true
// regexMatchDP("aabbbcddddd", "a*bcdd.dd") // false
// regexMatchDP("aabbbcddddd", "a.*bcd*") // true

Again, we maintain a 2-dimensional array dp (with dimension n+1 x m+1, where n is length of s and m is length of p) and have dp(i)(j) represent whether we have a regex match with only the i right-most characters in input string s and j right-most characters in pattern string p.

The exit strategy in the recursive code becomes the simple initialization of array dp:

dp(0)(0) = true

And, similar to the wildward implementation, the core programming logic that resembles the recursive version’s is carried out by iteratively computing dp(i)(j) of incremental indices starting from dp(0)(0):

for (i <- 1 to n) {
  for (j <- 1 to m) {
    val matchOne = p(j-1) == '.' || p(j-1) == s(i-1)
    if (j > 1 && p(j-2) == '*')
      dp(i)(j) = dp(i)(j-2) || (matchOne && dp(i-1)(j))
    else {
      if (p(j-1) == '*')  // <--- additional logic needed for DP
        dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
      else
        dp(i)(j) = matchOne && dp(i-1)(j-1)
    }
  }
}
dp(n)(m)

As noted in the extracted snippet though, there is a conditional logic that needs to be explicitly included in the dynamic programming version to handle the case of having * as pattern string p‘s last character. Since j stops at m, the p(j-2) == '*' condition will never get evaluated if the last pattern string character p(m-1) is *.

Fibonacci In Scala: Tailrec, Memoized

One of the most popular number series being used as a programming exercise is undoubtedly the Fibonacci numbers:

F(0) = 1
F(1) = 1
F(n) = F(n-1) + F(n-2)

Perhaps a prominent reason why the Fibonacci sequence is of vast interest in Math is the associated Golden Ratio, but I think what makes it a great programming exercise is that despite a simplistic definition, the sequence’s exponential growth rate presents challenges in implementations with space/time efficiency in mind.

Having seen various ways of implementing methods for the Fibonacci numbers, I thought it might be worth putting them together, from a naive implementation to something more space/time efficient. But first, let’s take a quick look at the computational complexity of Fibonacci.

Fibonacci complexity

If we denote T(n) as the time required to compute F(n), by definition:

T(n) = T(n-1) + T(n-2) + K

where K is the time taken by some simple arithmetic to arrive at F(n) from F(n-1) and F(n-2).

With some approximation Math analysis (see this post), it can be shown that the lower bound and upper bound of T(n) are O(2^(n/2)) and O(2^n), respectively. For better precision, one can derive a more exact time complexity by solving the associated characteristic equation, `x^2 = x + 1`, which yields x = ~1.618 to deduce that:

Time complexity for computing F(n) = O(R^n)

where R = ~1.618 is the Golden Ratio.

As for space complexity, if one looks at the recursive tree for computing F(n), it’s pretty clear that its depth is F(n-1)’s tree depth plus one. Thus, the required space for F(n) is proportional to n. In other words:

Space complexity for computing F(n) = O(n)

The relatively small space complexity compared with the exponential time complexity explains why computing a Fibonacci number too large for a computer would generally lead to an infinite run rather than a out-of-memory/stack overflow problem.

It’s worth noting, though, if F(n) is computed via conventional iterations (e.g. a while-loop or tail recursion which gets translated into iterations by Scala under the hood), the time complexity would be reduced to O(n) proportional to the number of the loop cycles. And the space complexity would be O(1) since no `n`-dependent extra space is needed other than that for storing the Fibonacci sequence.

Naive Fibonacci

To generate Fibonacci numbers, the most straight forward approach is via a basic recursive function like below:

def fib(n: Int): BigInt = n match {
  case 0 => 0
  case 1 => 1
  case _ => fib(n-2) + fib(n-1)
}

(0 to 10).foreach(n => print(fib(n) + " "))
// 0 1 1 2 3 5 8 13 21 34 55

fib(50)
// res1: BigInt = 12586269025

With such a `naive` recursive function, computing the 50th number, i.e. fib(50), would take minutes on a typical laptop, and attempts to compute any number higher up like fib(90) would most certainly lead to an infinite run.

Tail recursive Fibonacci

So, let’s come up with a tail recursive method:

def fibTR(num: Int): BigInt = {
  @scala.annotation.tailrec
  def fibFcn(n: Int, acc1: BigInt, acc2: BigInt): BigInt = n match {
    case 0 => acc1
    case 1 => acc2
    case _ => fibFcn(n - 1, acc2, acc1 + acc2)
  }

  fibFcn(num, 0, 1)
}

As shown above, tail recursion is accomplished by means of a couple of accumulators as parameters for the inner method to recursively carry over the two numbers that precede the current number.

With the Fibonacci `TailRec` version, computing, say, the 90th number would finish instantaneously.

fibTR(90)
// res2: BigInt = 2880067194370816120

Fibonacci in a Scala Stream

Another way of implementing Fibonacci is to define the sequence to be stored in a “lazy” collection, such as a Scala Stream:

val fibS: Stream[BigInt] = 0 #:: fibS.scan(BigInt(1))(_ + _)

fibS(90)
// res3: BigInt = 2880067194370816120

Using method scan, `scan(1)(_ + _)` generates a Stream with each of its elements being successively assigned the sum of the previous two elements. Since Streams are “lazy”, none of the element values in the defined `fibStream` will be evaluated until the element is being requested.

While at it, there is a couple of other commonly seen Fibonacci implementation variants with Scala Stream:

val fibS: Stream[BigInt] = 0 #:: 1 #:: (fibS zip fibS.tail).map(n => n._1 + n._2)

val fibS: Stream[BigInt] = {
  def fs(prev: BigInt, curr: BigInt): Stream[BigInt] = prev #:: fs(curr, prev + curr)
  fs(0, 1)
}

Scala Stream memoizes by design

These Stream-based Fibonacci implementations perform reasonably well, somewhat comparable to the tail recursive Fibonacci. But while these Stream implementations all involve recursion, none is tail recursive. So, why doesn’t it suffer the same performance issue like the `naive` Fibonacci implementation does? The short answer is memoization.

Digging into the source code of Scala Stream would reveal that method `#::` (which is wrapped in class ConsWrapper) is defined as:

def #::[B >: A](hd: B): Stream[B] = cons(hd, tl) 

Tracing method `cons` further reveals that the Stream tail is a by-name parameter to class `Cons`, thus ensuring that the concatenation is performed lazily:

final class Cons[+A](hd: A, tl: => Stream[A]) extends Stream[A]

But lazy evaluation via by-name parameter does nothing to memoization. Digging deeper into the source code, one would see that Stream content is iterated through a StreamIterator class defined as follows:

final class StreamIterator[+A] private() extends AbstractIterator[A] with Iterator[A] {
  def this(self: Stream[A]) {
    this()
    these = new LazyCell(self)
  }

  class LazyCell(st: => Stream[A]) {
    lazy val v = st
  }

  private var these: LazyCell = _

  def hasNext: Boolean = these.v.nonEmpty

  def next(): A =
    if (isEmpty) Iterator.empty.next()
    else {
      val cur    = these.v
      val result = cur.head
      these = new LazyCell(cur.tail)
      result
    }

  ...
}

The inner class `LazyCell` not only has a by-name parameter but, more importantly, makes the Stream represented by the StreamIterator instance a `lazy val` which, by nature, enables memoization by caching the value upon the first (and only first) evaluation.

Memoized Fibonacci using a mutable Map

While using a Scala Stream to implement Fibonacci would automatically leverage memoization, one could also explicitly employ the very feature without Streams. For instance, by leveraging method getOrElseUpdate in a mutable Map, a `memoize` function can be defined as follows:

// Memoization using mutable Map

def memoize[K, V](f: K => V): K => V = {
  val cache = scala.collection.mutable.Map.empty[K, V]
  k => cache.getOrElseUpdate(k, f(k))
}

For example, the `naive` Fibonacci equipped with memoization via this `memoize` function would instantly become a much more efficient implementation:

val fibM: Int => BigInt = memoize(n => n match {
  case 0 => 0
  case 1 => 1
  case _ => fibM(n-2) + fibM(n-1)
})

fibM(90)
// res4: BigInt = 2880067194370816120

For the tail recursive Fibonacci `fibTR`, this `memoize` function wouldn’t be applicable as its inner function `fibFcn` takes accumulators as additional parameters. As for the Stream-based `fibS` which is already equipped with Stream’s memoization, applying `memoize` wouldn’t produce any significant performance gain.