Skip to main content

In this tutorial, we will demonstrate how to build the data table from the exported GA4 analytics data in BigQuery. These table can be then used for producing a suitable dataset to interact with ChannelAttribution Pro libraries.

Compared with GA4, BigQuery offers more flexibility in the use of raw analytics data, providing a broader range of options to manage this information and gain powerful insights to fuel your marketing efforts.

To begin with, we will take a brief tour of the GA4 data exported to BigQuery and its nuances.


Event Preview


In order to generate df_session and df_conversion csv files, we only need to pay attention to the following event_* fields:

Event fields:

  • event_timestamp: the time (in microseconds) when the event was collected (in UTC time zone)
  • event_name: the name of the event (e.g. first_visit, session_start, purchase, etc.)
  • event_params: keys and values of the event parameters collected with the event. It is a nested structure composed by a key and an array of four possible values. The code displayed below uses many JSON_* and UNNEST SQL commands to getting those values

User fields:

  • user_pseudo_id: the pseudonymous ID (or client ID) of the user who triggered the event

E-commerce fields:

  • purchase_revenue_in_usd: purchase revenue collected with the event (only for purchase events)

Collected_traffic_source fields:

  • manual_campaign_name: the campaign name collected with the event (i.e. utm_campaign)
  • manual_source: the source collected with the event (i.e. utm_source)
  • manual_medium: the medium collected with the event (i.e. utm_medium)
There are two fields in GA4 raw data, one is traffic_source and the other is collected_traffic_source. Do not use the traffic_source fields, the traffic info in this field is actually the first visit traffic source, not the session level traffic.

The SQL script is as follows:

WITH prep_session_start AS (
SELECT
event_timestamp,
CASE
WHEN SAFE_CAST(event_timestamp AS INT64) IS NOT NULL THEN
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%E*S', TIMESTAMP_MICROS(CAST(event_timestamp AS INT64)))
ELSE NULL
END AS event_time,
CONCAT(
user_pseudo_id,
(
SELECT
value.int_value
FROM
UNNEST (event_params)
WHERE
key = 'ga_session_id'
)
) AS session_id,

user_pseudo_id,

ARRAY_AGG(
collected_traffic_source.manual_campaign_name IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_campaign_name,

ARRAY_AGG(
collected_traffic_source.manual_source IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_source,

ARRAY_AGG(
collected_traffic_source.manual_medium IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_medium

FROM
`raw_session_dataset.events_*`

WHERE
_TABLE_SUFFIX BETWEEN '20240301' AND '20240303'
AND event_name = "session_start"
AND user_pseudo_id IS NOT NULL

GROUP BY
event_timestamp,
user_pseudo_id,
session_id
),

add_channel AS (
SELECT
*,
CASE
WHEN session_source = '(direct)' and (session_medium = '(not set)' or session_medium = '(none)')
THEN 'Direct'
WHEN session_medium = 'organic'
THEN 'Organic Search'
WHEN regexp_contains(session_source, '^yandex|yahoo|msn|adpiler$')
THEN 'Organic Search'
WHEN regexp_contains(session_medium, '^social|social-network|social-media|sm|social network|social media$')
THEN 'Organic Social'
WHEN regexp_contains(session_source, '^facebook|instagram|linkedin|pinterest|flipboard|youtube $')
THEN 'Social'
WHEN session_medium = 'paid_social'
THEN 'Paid Social'
WHEN session_medium = 'email'
THEN 'Email'
WHEN session_medium = 'affiliate'
THEN 'Affiliates'
WHEN session_medium = 'referral'
THEN 'Referral'
WHEN regexp_contains(session_medium, '^cpc|ppc|paidsearch$')
THEN 'Paid Search'
WHEN regexp_contains(session_medium, '^cpv|cpa|cpp|content-text$')
THEN 'Other Advertising'
WHEN regexp_contains(session_medium, '^display|cpm|banner$')
THEN 'Display'
WHEN regexp_contains(session_campaign_name, '^PLA$')
THEN 'Paid Search'
WHEN regexp_contains(session_campaign_name, '^PMAX$')
THEN 'Cross Network'
WHEN regexp_contains(session_source, '^IGShopping|amazon-adsystem|teamshopping|adpiler|sfmc$') THEN 'Organic Shopping'
ELSE '(Other)'
END AS channel

FROM prep_session_start
),

session_final AS (
SELECT
event_timestamp AS event_time,
channel,
user_pseudo_id AS user_id,
NULL AS revenue,
0 AS conversion
FROM
add_channel
),

prep_purchase AS (
SELECT
CASE
WHEN SAFE_CAST(event_timestamp AS INT64) IS NOT NULL THEN
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%E*S', TIMESTAMP_MICROS(CAST(event_timestamp AS INT64)))
ELSE NULL
END AS event_time,
event_name AS channel,
user_pseudo_id AS user_id,
ecommerce.purchase_revenue_in_usd AS revenue,
1 AS conversion
FROM
`raw_session_dataset.events_*`
WHERE
_TABLE_SUFFIX BETWEEN '20240301' AND '20240303'
AND event_name = "purchase"
AND user_pseudo_id IS NOT NULL
),

purchase_final AS (
SELECT
event_time,
channel,
user_id,
revenue,
conversion
FROM
prep_purchase
),

final AS (

SELECT *
FROM session_final

UNION ALL

SELECT *
FROM purchase_final)

SELECT *
FROM final
ORDER BY user_id, event_time;

Here are some comments to clarify the logic behind the script:

  • Getting a true unique session id: To create the user journey, we need to find each visit of users. By concatenating user_pseudo_id and the GA4 property data ga_session_id found in the event_params nested structure, we can create a unique session id, which allows us to retrieve only the first event each time users visit our page (per session).

  • We convert event_timestamp, expressed as UTC integer, to a UTC string and assign it to the event_time field.

  • Collected_traffic_source structure: This is a collection of the same event-level traffic source details (manual_source, manual_medium, and manual_campaign_name dimensions, among others) that you can find nested inside the event_params JSON array.

  • Function details:

    • IGNORE NULLS Clause: When used with ARRAY_AGG, any null values in the collected_traffic_source column are excluded from the array, avoiding the need for post-processing to delete them.

    • SAFE_OFFSET(0): This is used to safely access the first element of the array. If the array is empty (which can happen if all values are null and thus ignored), SAFE_OFFSET returns null instead of an error.

    • GA4 update: Since the last update in Nov 2023, session_start and first_visit are events sharing the same information. Hence, we are using session_start in our SQL scripts.

    • _TABLE_SUFFIX: This is used for choosing from sharded tables whose name suffixes fall (e.g., events_20240301) within a given date range.

    • Be aware that sharded tables are generated automatically once BigQuery finds tables that share the following characteristics:

      • Exist in the same dataset
      • Have the exact same table schema
      • The same prefix
      • Have a suffix of the form _YYYYMMDD (eg. 20240501)
  • Channel Grouping: The channel grouping here is an example. One can define their own channel grouping based on their business strategy. It is implemented in the section add_channel.

  • event_name: We are only interested in records where event_name = 'session_start' or event_name = 'purchase'.

  • At the prep_purchase CTE, we rename the event as channel, this is on purpose so we can use this result directly in our python script for data transformation.

  • ecommerce.purchase_revenue_in_usd: Purchase revenue dimension collected with the event (only for purchase events).

Finally, the generated table looks like this:

Output Preview

The post-processing of this table will produce a final dataset suitable to be used in the ChannelAttribution Pro library interface.