Monthly Archives: October 2019

Spark – Schema With Nested Columns

Extracting columns based on certain criteria from a DataFrame (or Dataset) with a flat schema of only top-level columns is simple. It gets slightly less trivial, though, if the schema consists of hierarchical nested columns.

Recursive traversal

In functional programming, a common tactic to traverse arbitrarily nested collections of elements is through recursion. It’s generally preferred over using while-loops with mutable counters. For performance at scale, making the traversal tail-recursive may be necessary – although it’s less of a concern in this case given that a DataFrame typically consists not more than a few hundreds of columns and a few levels of nesting.

We’re going to illustrate in a couple of simple examples how recursion can be used to effectively process a DataFrame with a schema of nested columns.

Example #1:  Get all nested columns of a given data type

Consider the following snippet:

By means of a simple recursive method, the data type of each column in the DataFrame is traversed and, in the case of StructType, recurs to traverse its child columns. A string-type prefix during the traversal is assembled to express the hierarchy of the individual nested columns and gets prepended to columns with the matching data type.

Testing the method:

Example #2:  Rename all nested columns via a provided function

In this example, we’re going to rename columns in a DataFrame with a nested schema based on a provided rename function. The required logic for recursively traversing the nested columns is pretty much the same as in the previous example.

Testing the method (with the same DataFrame used in the previous example):

In case it isn’t obvious, in traversing a given StructType‘s child columns, we use map (as opposed to flatMap in the previous example) to preserve the hierarchical column structure.