String Pattern Matching Implementation

Regardless of the programming platform one is on, string pattern matching is often needed in various use cases, particularly as a filtering condition for data retrieval. Regex, short for regular expression, has been one of the most common and versatile string pattern matching methods. Then, there is also the more simplistic wildcard string pattern style.

In this blog post, we’re going to implement in Scala the wildcard pattern matching as well as a minified version of regex pattern matching. On top of that, we’ll also come up with their dynamic programming version by means of memoization for better performance.

Recursion will be the first approach to implementing the string pattern matching methods. Why? That’s because algorithmically it’s probably the crispest way to reason about the implementation logic. There is also the benefit that in many cases the more performant dynamic programming variants can be formulated based on the recursion logic.

Wildcard pattern matching via recursion

Wildcard pattern matching consists of two special characters ? and *, with ? matching any single character and * matching zero or more characters.

For example, wildcard pattern “a*b?c” would match “axybzc” and “abbc”, but not “axyzbc” or “aczc”.

// Wildcard pattern matching implementation via recursion in Scala
def wildcardMatch(s: String, p: String): Boolean = {
  if (s.isEmpty)
    p.isEmpty || p.forall(_ == '*')
  else {
    if (p.isEmpty)
      false
    else {
      if (p(0) == '*')
        wildcardMatch(s, p.substring(1)) || wildcardMatch(s.substring(1), p)
      else
        (p(0) == '?' || p(0) == s(0)) &&
          wildcardMatch(s.substring(1), p.substring(1))
    }
  }
}
// wildcardMatch("aabbbcddddd", "a*bcdd?dd") // true
// wildcardMatch("aabbbcddddd", "a*cdd?d") // false

As with any recursive code, we first come up with the “exit” strategy which in this case is that when the iterative parsing of the input string s is exhausted, the iterative interpreting of the wildcare pattern p must be either also exhausted or left with only * characters.

On the core recursion logic that evaluates pattern matching character *:

wildcardMatch(s, p.substring(1)) || wildcardMatch(s.substring(1), p)

if the current character of the pattern string p being iteratively parsed is * the algorithm will make sure — in a recursive manner — either it consumes the wildcare character to match zero character from the current character of the input string s, or it matches the current character of s and continues to use the wildcare character to evaluate for the next character of s.

As for the evaluation of pattern matching character ?:

(p(0) == '?' || p(0) == s(0)) &&
  wildcardMatch(s.substring(1), p.substring(1))

if the current character of the pattern string p being parsed is ?, it is no different from evaluating whether the current character of p is the same as that of s and the remaining substrings of p and s still match.

Wildcard pattern matching via dynamic programming

Despite the vast difference in approach between recursion and dynamic programming, it’s interesting that how the programming logic being formulated in the recursive version can be “borrowed” for the dynamic programming version. If one looks closely at the following dynamic programming implementation, its logic resembles a great deal of the recursive version’s.

// Wildcard pattern matching implementation via dynamic programming in Scala
def wildcardMatchDP(s: String, p: String): Boolean = {
  val n = s.length
  val m = p.length
  val dp: Array[Array[Boolean]] = Array.fill(n+1, m+1)(false)
  dp(0)(0) = true
  for (j <- 1 to m) {
    if (p(j-1) == '*')
      dp(0)(j) = dp(0)(j-1)
  }
  for (i <- 1 to n) {
    for (j <- 1 to m) {
      if (p(j-1) == '*')
        dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
      else
        dp(i)(j) = (p(j-1) == '?' || p(j-1) == s(i-1)) && dp(i-1)(j-1)
    }
  }
  dp(n)(m)
}
// wildcardMatchDP("aabbbcddddd", "a*bcdd?dd") // true
// wildcardMatchDP("aabbbcddddd", "a*cdd?d") // false

To use the programming logic from the recursive version for dynamic programing, we maintain a 2-dimensional array dp (with dimension n+1 x m+1, where n is length of s and m is length of p) and have dp(i)(j) represent whether we have a wildcard match with only the i right-most characters in input string s and j right-most characters in pattern string p. For instance, dp(0)(1) would represent whether we have a match with an empty s and a p with only its right-most character.

The exit strategy in the recursive code that says “when the iterative parsing of s finishes, either the parsing of p is also finished or the remaining characters in p must all be *” now becomes the initialization of array dp:

dp(0)(0) = true
for (j <- 1 to m) {
  if (p(j - 1) == '*')
    dp(0)(j) = dp(0)(j - 1)
}

For the core programming logic implementation, we iteratively build, bottom up, from the initialized array by computing dp(i)(j) values with the same reasoning of how the * and ? wildcard matching works:

for (i <- 1 to n) {
  for (j <- 1 to m) {
    if (p(j-1) == '*')
      dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
    else
      dp(i)(j) = (p(j-1) == '?' || p(j-1) == s(i-1)) && dp(i-1)(j-1)
  }
}
dp(n)(m)

It should be noted that one could come up similar programming logic for the dynamic programming implementation without referencing the recursive version at all. The above exercise is just an interesting showcase that two seemingly disparate programming methods could do their job using the same logic.

Performance wise, obviously the dynamic programming version with a time complexity of O(nm) is significantly better than the recursive version which is at the scale of O((n+m)2^(n+m)).

Regex pattern matching via recursion

Regex is a comprehensive pattern matching toolset equipped with a suite of versatile methods. We’ll be implementing only two of the most commonly used patterns . and *, with . matching any single character and * matching zero or more contiguous occurrences of the preceding character. In case the character preceding * is ., it’ll match zero or any combinations of characters.

For example, regex pattern “ab.c” would match “aaabzc” and “abbc”, but not “axybzc”. And “a.b” would match “axyzb” and “ab”.

// Regex pattern matching implementation via recursion in Scala
def regexMatch(s: String, p: String): Boolean = {
  def matchOne: Boolean = p(0) == '.' || p(0) == s(0)
  if (p.isEmpty)
    s.isEmpty
  else {
    if (p.length > 1 && p(1) == '*')
      regexMatch(s, p.substring(2)) || (matchOne && regexMatch(s.substring(1), p))
    else
      matchOne && regexMatch(s.substring(1), p.substring(1))
  }
}
// regexMatch("aabbbcddddd", "a.*bcdd.dd") // true
// regexMatch("aabbbcddddd", "a*bcdd.dd") // false
// regexMatch("aabbbcddddd", "a.*bcd*") // true

Similar to the recursive wildcard pattern matching implementation, we come up with the exist strategy that the iterative parsing of the input string s and the iteratively interpreting of the regex pattern p must be exhausted simultaneously.

Like wildcard’s ?, regex’s . has exactly the same matching pattern. Since pattern matching a single character will be used more than once (as can be seen shortly), we create a simple method matchOne:

def matchOne: Boolean = p(0) == '.' || p(0) == s(0)

Contrary to iterating s in the main recursive loop in the wildcard matching implementation, we iterate p for the convenience of evaluating regex’s * that always works with its preceding character — hence the need for singling out the case of p.length > 1.

if (p.length > 1 && p(1) == '*')
  regexMatch(s, p.substring(2)) || (matchOne && regexMatch(s.substring(1), p))
else
  matchOne && regexMatch(s.substring(1), p.substring(1))

Here, the main recursive logic is similar to that of the wildcard implementation, except that the character preceding * is now an integral element for the * pattern matching.

Regex pattern matching via dynamic programming

Next, we come up with the dynamic programming version by “borrowing” the programming logic, again, from the recursive version:

// Regex pattern matching implementation using dynamic programming in Scala
def regexMatchDP(s: String, p: String): Boolean = {
  val n = s.length
  val m = p.length
  val dp: Array[Array[Boolean]] = Array.fill(n+1, m+1)(false)
  dp(0)(0) = true
  for (i <- 1 to n) {
    for (j <- 1 to m) {
      val matchOne = p(j-1) == '.' || p(j-1) == s(i-1)
      if (j > 1 && p(j-2) == '*')
        dp(i)(j) = dp(i)(j-2) || (matchOne && dp(i-1)(j))
      else {
        if (p(j-1) == '*')
          dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
        else
          dp(i)(j) = matchOne && dp(i-1)(j-1)
      }
    }
  }
  dp(n)(m)
}
// regexMatchDP("aabbbcddddd", "a.*bcdd.dd") // true
// regexMatchDP("aabbbcddddd", "a*bcdd.dd") // false
// regexMatchDP("aabbbcddddd", "a.*bcd*") // true

Again, we maintain a 2-dimensional array dp (with dimension n+1 x m+1, where n is length of s and m is length of p) and have dp(i)(j) represent whether we have a regex match with only the i right-most characters in input string s and j right-most characters in pattern string p.

The exit strategy in the recursive code becomes the simple initialization of array dp:

dp(0)(0) = true

And, similar to the wildward implementation, the core programming logic that resembles the recursive version’s is carried out by iteratively computing dp(i)(j) of incremental indices starting from dp(0)(0):

for (i <- 1 to n) {
  for (j <- 1 to m) {
    val matchOne = p(j-1) == '.' || p(j-1) == s(i-1)
    if (j > 1 && p(j-2) == '*')
      dp(i)(j) = dp(i)(j-2) || (matchOne && dp(i-1)(j))
    else {
      if (p(j-1) == '*')  // <--- additional logic needed for DP
        dp(i)(j) = dp(i)(j-1) || dp(i-1)(j)
      else
        dp(i)(j) = matchOne && dp(i-1)(j-1)
    }
  }
}
dp(n)(m)

As noted in the extracted snippet though, there is a conditional logic that needs to be explicitly included in the dynamic programming version to handle the case of having * as pattern string p‘s last character. Since j stops at m, the p(j-2) == '*' condition will never get evaluated if the last pattern string character p(m-1) is *.

Leave a Reply

Your email address will not be published. Required fields are marked *