Markov model with additional features
A classical implementation of the Markov model includes channels without incorporating additional available information. In this article, we present an algorithm that performs transaction-level attribution using an ensemble of Markov models, estimating one model per available feature.
The algorithm
Data
Suppose you have the following data:
id_path | channel_pos | channel | region | segment | seconds_to_last_touch | position |
---|---|---|---|---|---|---|
0 | 1 | C | v05 | w08 | 1532 | first |
0 | 2 | C | v02 | w11 | 919 | middle |
0 | 3 | C | v05 | w03 | 317 | middle |
0 | 4 | C | v03 | w09 | 0 | last |
0 | 5 | ((CONV)) | NA | NA | NA | middle |
1 | 1 | C | v01 | w13 | 582 | first |
1 | 2 | C | v06 | w11 | 0 | last |
4 | 1 | C | v01 | w04 | 0 | first_last |
4 | 2 | ((CONV)) | NA | NA | NA | middle |
... | ... | ... | ... | ... | ... | ... |
id_path: Represents the customer journey.
channel_pos: Indicates the position of the channel within the customer journey.
channel: The marketing channel, where "((CONV))" denotes a conversion event.
region, segment, position: Categorical variables providing additional journey attributes.
seconds_to_last_touch: A numeric variable indicating the number of seconds from a given channel to the last touchpoint.
Discretization
The first step of the algorithm is discretizing numeric variables.
In our case seconds_to_last_touch is discretized.
id_path | channel_pos | channel | region | segment | seconds_to_last_touch | position |
---|---|---|---|---|---|---|
0 | 1 | C | v05 | w08 | bin_10 | first |
0 | 2 | C | v02 | w11 | bin_8 | middle |
0 | 3 | C | v05 | w03 | bin_5 | middle |
0 | 4 | C | v03 | w09 | bin_1 | last |
0 | 5 | ((CONV)) | NA | NA | NA | middle |
1 | 1 | C | v01 | w13 | bin_5 | first |
1 | 2 | C | v06 | w11 | bin_1 | last |
4 | 1 | C | v01 | w04 | bin_1 | first_last |
4 | 2 | ((CONV)) | NA | NA | NA | middle |
... | ... | ... | ... | ... | ... | ... |
Artificial channel
For each feature, we create an artificial channel by combining channel with that feature. For example, using region, we get:
id_path | channel_pos | channel_region |
---|---|---|
0 | 1 | C_v05 |
0 | 2 | C_v02 |
0 | 3 | C_v05 |
0 | 4 | C_v03 |
0 | 5 | ((CONV)) |
1 | 1 | C_v01 |
1 | 2 | C_v06 |
4 | 1 | C_v01 |
4 | 2 | ((CONV)) |
... | ... | ... |
Transposing the data:
path | total_conversions | total_nulls |
---|---|---|
C_v05 > C_v02 > C_v05 > C_v03 | 1 | 0 |
C_v01 > C_v06 | 0 | 1 |
C_v01 | 1 | 0 |
... | ... | ... |
Using this data, we fit a classical Markov model and calculate the odds for each channel. For example, the odds for are:
where is the conversion probability of , estimated from the transition matrix of the Markov model. Odds capture the effectiveness of each channel in driving conversions.
These odds are used for transaction-level attribution using them as weights at path level:
id_path | channel_pos | attribution_channel_region |
---|---|---|
0 | 1 | 0.15 |
0 | 2 | 0.40 |
0 | 3 | 0.15 |
0 | 4 | 0.30 |
0 | 5 | ((CONV)) |
1 | 1 | 0.20 |
1 | 2 | 0.40 |
4 | 1 | 0.20 |
4 | 2 | ((CONV)) |
... | ... | ... |
If you want to know more about Odds, please refer to this link
Now we need to evaluate the predictive performance of the model by calculating the area under the precision-recall curve (AUC-PR). We choose the precision-recall curve over the classical ROC curve because, in this type of problem, the dataset is usually imbalanced, with a higher number of non-converting paths compared to converting paths.
Final attribution
We estimate a different Markov model for each combination of (channel × feature) and collect transaction-level attributions for each, along with the model's predictive performance. The final transaction-level attribution is the weighted mean of all the collected attributions, where the weights are a transformation of the predictive performances.
The transformation we use is:
weight(X) = \frac{AUC\_PRE\_REC(X)} if or otherwise
where represents a generic feature, is the area under the precision-recall curve for feture and is the dataset's conversion rate.
is a measure between 0 and 1 because ranges from 0 to 1.
When , then , meaning that feature has no significant influence on the target variable.