March | 2019 | Genuine Blog

When analyzing time series activity data, it’s often useful to group the chronological activities into “target”-based sessions. These “targets” could be products, events, web pages, etc.

Using a simplified time series log of web page activities, we’ll look at how web page-based sessions can be created in this blog post.

Let’s say we have a log of chronological web page activities as shown below:

+----+-------------------+--------+-----------+
|user|          timestamp|    page|     action|
+----+-------------------+--------+-----------+
| 101|2018-06-01 10:30:15|    home|   redirect|
| 101|2018-06-01 10:31:00|    home| info-click|
| 101|2018-06-01 10:33:00|    home| info-click|
| 101|2018-06-01 10:33:45|products|      visit|
| 101|2018-06-01 10:35:00|products| info-click|
| 101|2018-06-01 10:50:30|products| info-click|
| 101|2018-06-01 11:40:15|about-us|      visit|
| 101|2018-06-01 11:41:00|about-us| info-click|
| 101|2018-06-01 15:00:45|    home|      visit|
| 101|2018-06-01 15:02:00|    home| info-click|
| ...|           ...     |     ...|        ...|
+----+-------------------+--------+-----------+

+----+-------------------+--------+-----------+

+----+-------------------+--------+-----------+

| 101|2018-06-01 10:30:15| home| redirect|

| 101|2018-06-01 10:31:00| home| info-click|

| 101|2018-06-01 10:33:00| home| info-click|

| 101|2018-06-01 10:33:45|products| visit|

| 101|2018-06-01 10:35:00|products| info-click|

| 101|2018-06-01 10:50:30|products| info-click|

| 101|2018-06-01 11:40:15|about-us| visit|

| 101|2018-06-01 11:41:00|about-us| info-click|

| 101|2018-06-01 15:00:45| home| visit|

| 101|2018-06-01 15:02:00| home| info-click|

| ...| ... | ...| ...|

+----+-------------------+--------+-----------+

And let’s say we want to group the log data by web page to generate user-defined sessions with format userID-#, where # is a monotonically increasing integer, like below:

+----+-------------------+--------+-----------+-------+
|user|          timestamp|    page|     action|sess_id|
+----+-------------------+--------+-----------+-------+
| 101|2018-06-01 10:30:15|    home|   redirect|  101-1|
| 101|2018-06-01 10:31:00|    home| info-click|  101-1|
| 101|2018-06-01 10:33:00|    home| info-click|  101-1|
| 101|2018-06-01 10:33:45|products|      visit|  101-2|
| 101|2018-06-01 10:35:00|products| info-click|  101-2|
| 101|2018-06-01 10:50:30|products| info-click|  101-2|
| 101|2018-06-01 11:40:15|about-us|      visit|  101-3|
| 101|2018-06-01 11:41:00|about-us| info-click|  101-3|
| 101|2018-06-01 15:00:45|    home|      visit|  101-4|
| 101|2018-06-01 15:02:00|    home| info-click|  101-4|
| ...|           ...     |     ...|        ...|    ...|
+----+-------------------+--------+-----------+-------+

+----+-------------------+--------+-----------+-------+

+----+-------------------+--------+-----------+-------+

| 101|2018-06-01 10:30:15| home| redirect| 101-1|

| 101|2018-06-01 10:31:00| home| info-click| 101-1|

| 101|2018-06-01 10:33:00| home| info-click| 101-1|

| 101|2018-06-01 10:33:45|products| visit| 101-2|

| 101|2018-06-01 10:35:00|products| info-click| 101-2|

| 101|2018-06-01 10:50:30|products| info-click| 101-2|

| 101|2018-06-01 11:40:15|about-us| visit| 101-3|

| 101|2018-06-01 11:41:00|about-us| info-click| 101-3|

| 101|2018-06-01 15:00:45| home| visit| 101-4|

| 101|2018-06-01 15:02:00| home| info-click| 101-4|

| ...| ... | ...| ...| ...|

+----+-------------------+--------+-----------+-------+

The first thing that pops up in one’s mind might be to perform a groupBy(user, page) or a Window partitionBy(user, page). But that wouldn’t work since doing so would disregard time gaps between the same page, resulting in all rows with the same page grouped together under a given user.

First thing first, let’s assemble a DataFrame with some sample web page activity data:

val df = Seq(
  (101, "2018-06-01 10:30:15", "home", "redirect"),
  (101, "2018-06-01 10:32:00", "home", "info-click"),
  (101, "2018-06-01 10:35:00", "home", "info-click"),
  (101, "2018-06-01 11:00:45", "products", "visit"),
  (101, "2018-06-01 11:12:00", "products", "info-click"),
  (101, "2018-06-01 11:25:30", "products", "info-click"),
  (101, "2018-06-01 11:38:15", "about-us", "info-click"),
  (101, "2018-06-01 11:50:00", "about-us", "info-click"),
  (101, "2018-06-01 12:01:45", "home", "visit"),
  (101, "2018-06-01 12:04:00", "home", "info-click"),
  (101, "2018-06-01 20:02:45", "home", "visit"),
  (101, "2018-06-01 20:40:00", "products", "info-click"),
  (101, "2018-06-01 20:46:30", "products", "info-click"),
  (101, "2018-06-01 20:50:15", "products", "add-to-cart"),
  (220, "2018-06-01 18:15:30", "home", "redirect"),
  (220, "2018-06-01 18:17:00", "home", "info-click"),
  (220, "2018-06-01 18:40:45", "home", "info-click"),
  (220, "2018-06-01 18:52:30", "home", "info-click"),
  (220, "2018-06-01 19:04:45", "products", "info-click"),
  (220, "2018-06-01 19:17:00", "products", "info-click"),
  (220, "2018-06-01 19:30:30", "products", "info-click"),
  (220, "2018-06-01 19:42:30", "products", "info-click"),
  (220, "2018-06-01 19:45:30", "products", "add-to-cart")
).toDF("user", "timestamp", "page", "action")

val df = Seq(

(101, "2018-06-01 10:30:15", "home", "redirect"),

(101, "2018-06-01 10:32:00", "home", "info-click"),

(101, "2018-06-01 10:35:00", "home", "info-click"),

(101, "2018-06-01 11:00:45", "products", "visit"),

(101, "2018-06-01 11:12:00", "products", "info-click"),

(101, "2018-06-01 11:25:30", "products", "info-click"),

(101, "2018-06-01 11:38:15", "about-us", "info-click"),

(101, "2018-06-01 11:50:00", "about-us", "info-click"),

(101, "2018-06-01 12:01:45", "home", "visit"),

(101, "2018-06-01 12:04:00", "home", "info-click"),

(101, "2018-06-01 20:02:45", "home", "visit"),

(101, "2018-06-01 20:40:00", "products", "info-click"),

(101, "2018-06-01 20:46:30", "products", "info-click"),

(101, "2018-06-01 20:50:15", "products", "add-to-cart"),

(220, "2018-06-01 18:15:30", "home", "redirect"),

(220, "2018-06-01 18:17:00", "home", "info-click"),

(220, "2018-06-01 18:40:45", "home", "info-click"),

(220, "2018-06-01 18:52:30", "home", "info-click"),

(220, "2018-06-01 19:04:45", "products", "info-click"),

(220, "2018-06-01 19:17:00", "products", "info-click"),

(220, "2018-06-01 19:30:30", "products", "info-click"),

(220, "2018-06-01 19:42:30", "products", "info-click"),

(220, "2018-06-01 19:45:30", "products", "add-to-cart")

).toDF("user", "timestamp", "page", "action")

The solution to be presented here involves a few steps:

Generate a new column first_ts which, for each user, has the value of timestamp in the current row if the page value is different from that in the previous row; otherwise null.
Backfill all the nulls in first_ts with the last non-null value via Window function last() and store in the new column sess_ts.
Assemble session IDs by concatenating user and the dense_rank of sess_ts within each user partition.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._

val winUserTS = Window.partitionBy($"user").orderBy("timestamp")
val winUserSess = Window.partitionBy($"user").orderBy("sess_ts")

df.
  withColumn( "first_ts", when( row_number.over(winUserTS) === 1 ||
      $"page" =!= lag($"page", 1).over(winUserTS), $"timestamp"
    ) ).
  withColumn( "sess_ts", last($"first_ts", ignoreNulls = true).over(
      winUserTS.rowsBetween(Window.unboundedPreceding, 0)
    ) ).
  withColumn( "session_id", concat($"user", lit("-"), dense_rank.over(winUserSess)) ).
  show(50)
+----+-------------------+--------+-----------+-------------------+-------------------+----------+
|user|          timestamp|    page|     action|           first_ts|            sess_ts|session_id|
+----+-------------------+--------+-----------+-------------------+-------------------+----------+
| 101|2018-06-01 10:30:15|    home|   redirect|2018-06-01 10:30:15|2018-06-01 10:30:15|     101-1|
| 101|2018-06-01 10:32:00|    home| info-click|               null|2018-06-01 10:30:15|     101-1|
| 101|2018-06-01 10:35:00|    home| info-click|               null|2018-06-01 10:30:15|     101-1|
| 101|2018-06-01 11:00:45|products|      visit|2018-06-01 11:00:45|2018-06-01 11:00:45|     101-2|
| 101|2018-06-01 11:12:00|products| info-click|               null|2018-06-01 11:00:45|     101-2|
| 101|2018-06-01 11:25:30|products| info-click|               null|2018-06-01 11:00:45|     101-2|
| 101|2018-06-01 11:38:15|about-us| info-click|2018-06-01 11:38:15|2018-06-01 11:38:15|     101-3|
| 101|2018-06-01 11:50:00|about-us| info-click|               null|2018-06-01 11:38:15|     101-3|
| 101|2018-06-01 12:01:45|    home|      visit|2018-06-01 12:01:45|2018-06-01 12:01:45|     101-4|
| 101|2018-06-01 12:04:00|    home| info-click|               null|2018-06-01 12:01:45|     101-4|
| 101|2018-06-01 20:02:45|    home|      visit|               null|2018-06-01 12:01:45|     101-4|
| 101|2018-06-01 20:40:00|products| info-click|2018-06-01 20:40:00|2018-06-01 20:40:00|     101-5|
| 101|2018-06-01 20:46:30|products| info-click|               null|2018-06-01 20:40:00|     101-5|
| 101|2018-06-01 20:50:15|products|add-to-cart|               null|2018-06-01 20:40:00|     101-5|
| 220|2018-06-01 18:15:30|    home|   redirect|2018-06-01 18:15:30|2018-06-01 18:15:30|     220-1|
| 220|2018-06-01 18:17:00|    home| info-click|               null|2018-06-01 18:15:30|     220-1|
| 220|2018-06-01 18:40:45|    home| info-click|               null|2018-06-01 18:15:30|     220-1|
| 220|2018-06-01 18:52:30|    home| info-click|               null|2018-06-01 18:15:30|     220-1|
| 220|2018-06-01 19:04:45|products| info-click|2018-06-01 19:04:45|2018-06-01 19:04:45|     220-2|
| 220|2018-06-01 19:17:00|products| info-click|               null|2018-06-01 19:04:45|     220-2|
| 220|2018-06-01 19:30:30|products| info-click|               null|2018-06-01 19:04:45|     220-2|
| 220|2018-06-01 19:42:30|products| info-click|               null|2018-06-01 19:04:45|     220-2|
| 220|2018-06-01 19:45:30|products|add-to-cart|               null|2018-06-01 19:04:45|     220-2|
+----+-------------------+--------+-----------+-------------------+-------------------+----------+

import org.apache.spark.sql.functions._

import org.apache.spark.sql.expressions.Window

import spark.implicits._

val winUserTS = Window.partitionBy($"user").orderBy("timestamp")

val winUserSess = Window.partitionBy($"user").orderBy("sess_ts")

df.

withColumn( "first_ts", when( row_number.over(winUserTS) === 1 ||

$"page" =!= lag($"page", 1).over(winUserTS), $"timestamp"

) ).

withColumn( "sess_ts", last($"first_ts", ignoreNulls = true).over(

winUserTS.rowsBetween(Window.unboundedPreceding, 0)

) ).

withColumn( "session_id", concat($"user", lit("-"), dense_rank.over(winUserSess)) ).

show(50)

+----+-------------------+--------+-----------+-------------------+-------------------+----------+

+----+-------------------+--------+-----------+-------------------+-------------------+----------+

| 101|2018-06-01 10:30:15| home| redirect|2018-06-01 10:30:15|2018-06-01 10:30:15| 101-1|

| 101|2018-06-01 10:32:00| home| info-click| null|2018-06-01 10:30:15| 101-1|

| 101|2018-06-01 10:35:00| home| info-click| null|2018-06-01 10:30:15| 101-1|

| 101|2018-06-01 11:00:45|products| visit|2018-06-01 11:00:45|2018-06-01 11:00:45| 101-2|

| 101|2018-06-01 11:12:00|products| info-click| null|2018-06-01 11:00:45| 101-2|

| 101|2018-06-01 11:25:30|products| info-click| null|2018-06-01 11:00:45| 101-2|

| 101|2018-06-01 11:38:15|about-us| info-click|2018-06-01 11:38:15|2018-06-01 11:38:15| 101-3|

| 101|2018-06-01 11:50:00|about-us| info-click| null|2018-06-01 11:38:15| 101-3|

| 101|2018-06-01 12:01:45| home| visit|2018-06-01 12:01:45|2018-06-01 12:01:45| 101-4|

| 101|2018-06-01 12:04:00| home| info-click| null|2018-06-01 12:01:45| 101-4|

| 101|2018-06-01 20:02:45| home| visit| null|2018-06-01 12:01:45| 101-4|

| 101|2018-06-01 20:40:00|products| info-click|2018-06-01 20:40:00|2018-06-01 20:40:00| 101-5|

| 101|2018-06-01 20:46:30|products| info-click| null|2018-06-01 20:40:00| 101-5|

| 101|2018-06-01 20:50:15|products|add-to-cart| null|2018-06-01 20:40:00| 101-5|

| 220|2018-06-01 18:15:30| home| redirect|2018-06-01 18:15:30|2018-06-01 18:15:30| 220-1|

| 220|2018-06-01 18:17:00| home| info-click| null|2018-06-01 18:15:30| 220-1|

| 220|2018-06-01 18:40:45| home| info-click| null|2018-06-01 18:15:30| 220-1|

| 220|2018-06-01 18:52:30| home| info-click| null|2018-06-01 18:15:30| 220-1|

| 220|2018-06-01 19:04:45|products| info-click|2018-06-01 19:04:45|2018-06-01 19:04:45| 220-2|

| 220|2018-06-01 19:17:00|products| info-click| null|2018-06-01 19:04:45| 220-2|

| 220|2018-06-01 19:30:30|products| info-click| null|2018-06-01 19:04:45| 220-2|

| 220|2018-06-01 19:42:30|products| info-click| null|2018-06-01 19:04:45| 220-2|

| 220|2018-06-01 19:45:30|products|add-to-cart| null|2018-06-01 19:04:45| 220-2|

+----+-------------------+--------+-----------+-------------------+-------------------+----------+

Note that the final output includes all the intermediate columns (i.e. first_ts and sess_ts) for demonstration purpose.

Genuine Blog

A Tech Blog by Leo Cheung

Monthly Archives: March 2019

Spark – Time Series Sessions