Monthly Archives: March 2019

Spark – Time Series Sessions

When analyzing time series activity data, it’s often useful to group the chronological activities into “target”-based sessions. These “targets” could be products, events, web pages, etc.

Using a simplified time series log of web page activities, we’ll look at how web page-based sessions can be created in this blog post.

Let’s say we have a log of chronological web page activities as shown below:

And let’s say we want to group the log data by web page to generate user-defined sessions with format userID-#, where # is a monotonically increasing integer, like below:

The first thing that pops up in one’s mind might be to perform a groupBy(user, page) or a Window partitionBy(user, page). But that wouldn’t work since doing so would disregard time gaps between the same page, resulting in all rows with the same page grouped together under a given user.

First thing first, let’s assemble a DataFrame with some sample web page activity data:

The solution to be presented here involves a few steps:

  1. Generate a new column first_ts which, for each user, has the value of timestamp in the current row if the page value is different from that in the previous row; otherwise null.
  2. Backfill all the nulls in first_ts with the last non-null value via Window function last() and store in the new column sess_ts.
  3. Assemble session IDs by concatenating user and the dense_rank of sess_ts within each user partition.

Note that the final output includes all the intermediate columns (i.e. first_ts and sess_ts) for demonstration purpose.