Skip to main content
Version: 3.12

In this tutorial, we will demonstrate how to build the data table from the exported GA4 analytics data in BigQuery. These table can be then used for producing a suitable dataset to interact with ChannelAttribution Pro libraries.

Compared with GA4, BigQuery offers more flexibility in the use of raw analytics data, providing a broader range of options to manage this information and gain powerful insights to fuel your marketing efforts.

To begin with, we will take a brief tour of the GA4 data exported to BigQuery and its nuances.


Event Preview


In order to generate df_session and df_conversion csv files, we only need to pay attention to the following event_* fields:

Event fields:

  • event_timestamp: the time (in microseconds) when the event was collected (in UTC time zone)
  • event_name: the name of the event (e.g. first_visit, session_start, purchase, etc.)
  • event_params: keys and values of the event parameters collected with the event. It is a nested structure composed by a key and an array of four possible values. The code displayed below uses many JSON_* and UNNEST SQL commands to getting those values

User fields:

  • user_pseudo_id: the pseudonymous ID (or client ID) of the user who triggered the event

E-commerce fields:

  • purchase_revenue_in_usd: purchase revenue collected with the event (only for purchase events)

Collected_traffic_source fields:

  • manual_campaign_name: the campaign name collected with the event (i.e. utm_campaign)
  • manual_source: the source collected with the event (i.e. utm_source)
  • manual_medium: the medium collected with the event (i.e. utm_medium)
There are two fields in GA4 raw data, one is traffic_source and the other is collected_traffic_source. Do not use the traffic_source fields, the traffic info in this field is actually the first visit traffic source, not the session level traffic.

The SQL script is as follows:

WITH prep_session_start AS (
SELECT
event_timestamp,
CASE
WHEN SAFE_CAST(event_timestamp AS INT64) IS NOT NULL THEN
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%E*S', TIMESTAMP_MICROS(CAST(event_timestamp AS INT64)))
ELSE NULL
END AS event_time,
CONCAT(
user_pseudo_id,
(
SELECT
value.int_value
FROM
UNNEST (event_params)
WHERE
key = 'ga_session_id'
)
) AS session_id,

user_pseudo_id,

ARRAY_AGG(
collected_traffic_source.manual_campaign_name IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_campaign_name,

ARRAY_AGG(
collected_traffic_source.manual_source IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_source,

ARRAY_AGG(
collected_traffic_source.manual_medium IGNORE NULLS
ORDER BY
event_timestamp ASC
LIMIT
1
) [SAFE_OFFSET(0)] AS session_medium

FROM
`raw_session_dataset.events_*`

WHERE
_TABLE_SUFFIX BETWEEN '20240301' AND '20240303'
AND event_name = "session_start"
AND user_pseudo_id IS NOT NULL

GROUP BY
event_timestamp,
user_pseudo_id,
session_id
),

add_channel AS (
SELECT
*,
CASE
WHEN session_source = '(direct)' and (session_medium = '(not set)' or session_medium = '(none)')
THEN 'Direct'
WHEN session_medium = 'organic'
THEN 'Organic Search'
WHEN regexp_contains(session_source, '^yandex|yahoo|msn|adpiler$')
THEN 'Organic Search'
WHEN regexp_contains(session_medium, '^social|social-network|social-media|sm|social network|social media$')
THEN 'Organic Social'
WHEN regexp_contains(session_source, '^facebook|instagram|linkedin|pinterest|flipboard|youtube $')
THEN 'Social'
WHEN session_medium = 'paid_social'
THEN 'Paid Social'
WHEN session_medium = 'email'
THEN 'Email'
WHEN session_medium = 'affiliate'
THEN 'Affiliates'
WHEN session_medium = 'referral'
THEN 'Referral'
WHEN regexp_contains(session_medium, '^cpc|ppc|paidsearch$')
THEN 'Paid Search'
WHEN regexp_contains(session_medium, '^cpv|cpa|cpp|content-text$')
THEN 'Other Advertising'
WHEN regexp_contains(session_medium, '^display|cpm|banner$')
THEN 'Display'
WHEN regexp_contains(session_campaign_name, '^PLA$')
THEN 'Paid Search'
WHEN regexp_contains(session_campaign_name, '^PMAX$')
THEN 'Cross Network'
WHEN regexp_contains(session_source, '^IGShopping|amazon-adsystem|teamshopping|adpiler|sfmc$') THEN 'Organic Shopping'
ELSE '(Other)'
END AS channel

FROM prep_session_start
),

session_final AS (
SELECT
event_timestamp AS event_time,
channel,
user_pseudo_id AS user_id,
NULL AS revenue,
0 AS conversion
FROM
add_channel
),

prep_purchase AS (
SELECT
CASE
WHEN SAFE_CAST(event_timestamp AS INT64) IS NOT NULL THEN
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%E*S', TIMESTAMP_MICROS(CAST(event_timestamp AS INT64)))
ELSE NULL
END AS event_time,
event_name AS channel,
user_pseudo_id AS user_id,
ecommerce.purchase_revenue_in_usd AS revenue,
1 AS conversion
FROM
`raw_session_dataset.events_*`
WHERE
_TABLE_SUFFIX BETWEEN '20240301' AND '20240303'
AND event_name = "purchase"
AND user_pseudo_id IS NOT NULL
),

purchase_final AS (
SELECT
event_time,
channel,
user_id,
revenue,
conversion
FROM
prep_purchase
),

final AS (

SELECT *
FROM session_final

UNION ALL

SELECT *
FROM purchase_final)

SELECT *
FROM final
ORDER BY user_id, event_time;

Here are some comments to clarify the logic behind the script:

  • Getting a true unique session id: To create the user journey, we need to find each visit of users. By concatenating user_pseudo_id and the GA4 property data ga_session_id found in the event_params nested structure, we can create a unique session id, which allows us to retrieve only the first event each time users visit our page (per session).

  • We convert event_timestamp, expressed as UTC integer, to a UTC string and assign it to the event_time field.

  • Collected_traffic_source structure: This is a collection of the same event-level traffic source details (manual_source, manual_medium, and manual_campaign_name dimensions, among others) that you can find nested inside the event_params JSON array.

  • Function details:

    • IGNORE NULLS Clause: When used with ARRAY_AGG, any null values in the collected_traffic_source column are excluded from the array, avoiding the need for post-processing to delete them.

    • SAFE_OFFSET(0): This is used to safely access the first element of the array. If the array is empty (which can happen if all values are null and thus ignored), SAFE_OFFSET returns null instead of an error.

    • GA4 update: Since the last update in Nov 2023, session_start and first_visit are events sharing the same information. Hence, we are using session_start in our SQL scripts.

    • _TABLE_SUFFIX: This is used for choosing from sharded tables whose name suffixes fall (e.g., events_20240301) within a given date range.

    • Be aware that sharded tables are generated automatically once BigQuery finds tables that share the following characteristics:

      • Exist in the same dataset
      • Have the exact same table schema
      • The same prefix
      • Have a suffix of the form _YYYYMMDD (eg. 20240501)
  • Channel Grouping: The channel grouping here is an example. One can define their own channel grouping based on their business strategy. It is implemented in the section add_channel.

  • event_name: We are only interested in records where event_name = 'session_start' or event_name = 'purchase'.

  • At the prep_purchase CTE, we rename the event as channel, this is on purpose so we can use this result directly in our python script for data transformation.

  • ecommerce.purchase_revenue_in_usd: Purchase revenue dimension collected with the event (only for purchase events).

Finally, the generated table looks like this:

Output Preview

The post-processing of this table will produce a final dataset suitable to be used in the ChannelAttribution Pro library interface.