Welcome to our comprehensive guide on preparing your data for effective channel attribution analysis. This post will walk you through the essential steps of data cleaning and wrangling, ensuring your data aligns perfectly with the requirements of our channel attribution library.
To effectively utilize our library, your data must be formatted in a specific manner (see Fig 1). This post will guide you through the necessary Python script for data cleaning and wrangling to prepare your data for the channel attribution library.

Typically, event data will include a user ID or cookies, event time, channel, and conversion status (see Fig 2). Conversion time is optional and used for businesses with longer purchase cycles to set appropriate cut-off limits. Note that here the data take purchased activity as a channel.

Fig 2. Example of common input data format by event sequence.
Now that we have established the data requirements, let's proceed with the step-by-step transformation process, starting with preparing the DataFrame.
Begin by converting data types and sorting by user ID and time to accurately capture user touchpoints.
import pandas as pd
# Reading the CSV file with all columns as strings
df = pd.read_csv('raw.csv', dtype="str")
# Convert 'revenue' column to numeric, converting invalid parsing to NaN
df['revenue'] = pd.to_numeric(df['revenue'])
# Convert 'conversion' column to float
df['conversion'] = df['conversion'].astype('float')
# Convert 'time' column to datetime with microsecond precision
df['time'] = pd.to_datetime(df['time'], unit="us")
# Sort the DataFrame by 'user_id' and 'time' columns and reset the index
df = df.sort_values(by=['user_id', 'time']).reset_index(drop=True)
Before we dive deeper into journey identification and data transformation, it's important to examine the distribution of channels within our dataset. This analysis helps us understand whether our data is balanced across different channels, which is crucial for unbiased analysis and modeling.
channel_counts = df['channel'].value_counts()
# Plotting the channel counts
plt.figure(figsize=(10, 6))
channel_counts.plot(kind='bar', color='skyblue')
plt.title('Channel Frequency')
plt.xlabel('Channel')
plt.ylabel('Frequency')
plt.xticks(rotation=45)  # Rotating the x-axis labels for better readability
plt.show()
Here is the bar chart representing the frequency of each channel:

- Dominance of Specific Channels: The channels 'Other' and 'Organic Search' are the most represented in the dataset. This suggests that a significant portion of our traffic and user engagement stems from these sources.
- Moderate Engagement Channels: Channels like 'Affiliates', 'Email', and 'Referral' show moderate engagement levels. These channels are crucial as they often indicate more direct and targeted user interactions.
- Underutilized Channels: The lower frequency of channels such as 'Purchased' and 'Organic Shopping' points to potential growth areas. These channels may be underutilized or underdeveloped in our current strategy.
- Optimization Opportunities: The disparity in channel usage indicates that there may be substantial opportunities to optimize underperforming channels like 'Paid Search' and 'Organic Social'.
- Balanced Marketing Approach: Considering the dominance of 'Other' and 'Organic Search', there could be a risk of over-reliance on these channels. Diversifying our approach could help mitigate risks associated with changes in search engine algorithms or platform policies.
Once we have a clear view of how channels are distributed, the next step is to delineate the path of each conversion into what we term a "journey." Journey are categorized into three types:
- From the user's first visit to the purchase.
- From a post-purchase visit to the next purchase.
- From the first to the last visit without any purchase.
To track new journeys, use the following code:
# Create a new column 'is_new_journey' where a new journey starts after each conversion
df['is_new_journey'] = ((df['conversion'] == 1).shift(1, fill_value=0)).astype(int)
# Identify changes in user_pseudo_id to mark the start of a new journey
cookie_change = df['user_id'] != df['user_id'].shift(1)
df.loc[cookie_change, 'is_new_journey'] = 1 
# Cumulatively sum 'is_new_journey' to assign a unique journey ID to each journey
df['journey_id'] = df['is_new_journey'].cumsum()

Fig 3. Illustration of adding a global journey column.
For longer purchase cycles, limit the analysis to recent interactions:
# Find the lastest event time for each journey and rename the series to 'lastest_event_time', 
lastest_event_time = df.groupby('journey_id')['time'].max().rename('lastest_event_time')
# Join the 'lastest_event_time' series back to the original DataFrame on the 'journey_id' column
df = df.join(lastest_event_time, on='journey_id')
# Filter the DataFrame to keep rows where the 'time' is within 30 days of 'lastest_event_time' 
df = df[(df['time'] >= df['lastest_event_time'] - pd.Timedelta(days=30))]
To gain insights into user behavior and optimize marketing strategies, we aggregate the data by each journey. This process involves compiling the sequence of channel interactions into a single string for each journey, providing a clear visual representation of user journeys. We use the ' > ' symbol as a separator to concatenate the channels interacted with.
# Filter out rows where the 'channel' column contains 'nan' or 'purchased'
df_final = df[df['channel']!= 'purchased']
# Group the data by 'journey_id' and concatenate the channel names into a single string for each journey
df_paths = df_final .groupby('journey_id').agg(
    path=('channel', ' > '.join)
).reset_index()

Fig 4. Example of concatenated channel paths.
We apply functions to clean up paths by removing consecutive duplicates from paths. This ensures that our analysis focuses purely on the influence of pre-conversion touchpoints:
def remove_consecutive_duplicates(path):
    # Split the input path string into individual components based on the separator ' > ' 
    components = path.split(' > ')
    
    # Create a list of components, omitting consecutive duplicates
    cleaned_components = [components[i] for i in range(len(components) - 1) 
                          if components[i] != components[i + 1]] + [components[-1]]
    
    # Rejoin the cleaned components into a single string with the original separator
    return ' > '.join(cleaned_components)
# Apply the function to each path in the 'path' column to remove consecutive duplicates
df_paths['path'] = df_paths['path'].apply(remove_consecutive_duplicates)
Merge the conversion data to include only the final interaction from each journey, ensuring that analysis focuses on the outcomes of the last user actions within each journey. This method captures the most crucial touchpoints that potentially lead to conversions, providing a clearer picture of which interactions are most effective at driving final user decisions.
# Drop duplicates within the Data to keep only the last interaction for each journey
df_last_interaction = df.drop_duplicates('journey_id', keep='last')[['journey_id', 'conversion', 'revenue']]
# Merge the path data with the last interaction data based on the journey identifier
df_paths = df_paths.merge(df_last_interaction, on='journey_id', how='left')
Finally, we aggregate and calculate key metrics for each user path. This includes the total number of conversions, the total revenue associated with these conversions
# Create a new column 'null' in df_paths where each entry is 1 if 'conversion' is 0, otherwise 0
df_paths['null'] = df_paths['conversion'].apply(lambda x: 1 if x == 0 else 0)
# Group the data in df_paths by 'path' and aggregate the total conversions, total non-conversions, 
# and total revenue for each path
attribution_df = df_paths.groupby('path').agg(
    total_conversions=('conversion', 'sum'),  # Sum of conversions for each path
    total_null=('null', 'sum'),               # Sum of non-conversions (null conversions) for each path
    total_conversion_value=('revenue', 'sum') # Sum of revenue for each path
).reset_index()

Fig 6. Final dataframe formatted for the ChannelAttribution library.
This post detailed the steps involved in cleaning and transforming data to prepare it for use. In the next post , we will show you how to use the library and how to explain the result.