Skip to content

Session Groups Problem


You run an ecommerce site called shoesfordogs.com . You want to analyze your visitors, so you compile a DataFrame called hits that represents each time a visitor hit some page on your site.

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

generator = np.random.default_rng(90)
products = ['iev','pys','vae','dah','yck','axl','apx','evu','wqv','tfg','aur','rgy','kef','lzj','kiz','oma']
hits = pd.DataFrame({
    'visitor_id':generator.choice(5, size=20, replace=True) + 1,
    'session_id':generator.choice(4, size=20, replace=True),
    'date_time':pd.to_datetime('2020-01-01') + pd.to_timedelta(generator.choice(60, size=20), unit='m'),
    'page_url':[f'shoesfordogs.com/product/{x}' for x in generator.choice(products, size=20, replace=True)]
})
hits['session_id'] = hits.visitor_id * 100 + hits.session_id

print(hits)
#     visitor_id  session_id           date_time                      page_url
# 0            4         400 2020-01-01 00:05:00  shoesfordogs.com/product/pys
# 1            2         200 2020-01-01 00:18:00  shoesfordogs.com/product/oma
# 2            1         102 2020-01-01 00:48:00  shoesfordogs.com/product/evu
# 3            4         403 2020-01-01 00:21:00  shoesfordogs.com/product/oma
# 4            2         201 2020-01-01 00:40:00  shoesfordogs.com/product/yck
# 5            3         302 2020-01-01 00:33:00  shoesfordogs.com/product/pys
# 6            2         203 2020-01-01 00:37:00  shoesfordogs.com/product/rgy
# 7            3         302 2020-01-01 00:54:00  shoesfordogs.com/product/tfg
# 8            3         302 2020-01-01 00:48:00  shoesfordogs.com/product/kef
# 9            4         402 2020-01-01 00:24:00  shoesfordogs.com/product/apx
# 10           3         300 2020-01-01 00:49:00  shoesfordogs.com/product/kef
# 11           1         101 2020-01-01 00:52:00  shoesfordogs.com/product/iev
# 12           3         302 2020-01-01 00:01:00  shoesfordogs.com/product/dah
# 13           4         403 2020-01-01 00:02:00  shoesfordogs.com/product/lzj
# 14           4         401 2020-01-01 00:42:00  shoesfordogs.com/product/evu
# 15           5         500 2020-01-01 00:39:00  shoesfordogs.com/product/apx
# 16           5         503 2020-01-01 00:31:00  shoesfordogs.com/product/dah
# 17           3         303 2020-01-01 00:01:00  shoesfordogs.com/product/lzj
# 18           2         200 2020-01-01 00:16:00  shoesfordogs.com/product/aur
# 19           1         100 2020-01-01 00:11:00  shoesfordogs.com/product/apx

You suspect that the undocumented third-party tracking system on your website is buggy and sometimes splits one session into two or more session_ids. You want to correct this behavior by creating a field called session_group_id that stitches broken session_ids together.

Two session, A & B, should belong to the same session group if

  1. They have the same visitor_id and
    1. Their hits overlap in time or
    2. The latest hit from A is within five minutes of the earliest hit from B, or vice-versa

Associativity applies. So, if A is grouped with B, and B is grouped with C, then A should be grouped with C as well.

Create a column in hits called session_group_id that identifies which hits belong to the same session group.


Try with Google Colab