top of page
Search

GA4 and BigQuery: how to match the Console traffic grouping metrics

Google Analytics 4 offers excellent functionality with its GA4 BigQuery dataset. However, you might find it challenging to align your GA4 UI figures with the unnested sources, mediums, and campaigns in BigQuery.


So what happens in BigQuery compared to GA4?

When data is generated, it’s sent directly to BigQuery as raw data without any pre-processing. However, once it’s in GA4, Google applies various models and transformations. This includes joining data, modelling session traffic, adding cost data, and replacing ISO2 codes with language names.


Matching up your numbers

If you unnest the data in GA4 to extract the session source, you’ll notice significant discrepancies between your GA4 UI and GA4 BigQuery data, particularly if you have extensive paid and social activity.

The sample code below displays the raw traffic source, resulting in more users appearing as ‘direct’ or ‘organic’ than in the UI. This discrepancy arises because the data is unprocessed.



SELECT
DISTINCT
(select value.string_value from unnest(event_params) wherekey = 'source')  AS source, 
(select value.string_value from unnest(event_params) wherekey = 'medium')  AS medium,
count(distinct(concat(user_pseudo_id,'.',
(select cast(value.int_value asstring) from unnest(event_params) wherekey = 'ga_session_id')))) AS session_count
FROM `analytics_123456789.events_20230831` 
WHERE event_name = 'page_view'
AND (select value.int_value from unnest(event_params) wherekey = 'entrances') =1 
GROUPBY1,2
ORDERBY3 DESC
 ;

To mirror the GA4 User Interface, you’ll need to further customize this code. This is due to Google’s handling of direct and organic traffic for sessions in the UI. Specifically, if a user comes from a direct source, the system will use the user’s traffic source.

This observation is based on my extensive testing and data analysis. By making this adjustment, the figures align more closely with the GA4 UI, clarifying why CPC mediums might appear much lower in GA4 BigQuery compared to the GA4 UI.



SELECT DISTINCT
    CASE
        WHEN (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') IS NULL THEN traffic_source.source
        WHEN (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') = 'organic'
             AND regexp_contains(traffic_source.medium, '(cpc|paid|social|email|referral)') THEN traffic_source.source
        ELSE (SELECT value.string_value FROM unnest(event_params) WHERE key = 'source')
    END AS source,
    
    CASE
        WHEN (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') IS NULL THEN traffic_source.medium
        WHEN (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') = 'organic'
             AND regexp_contains(traffic_source.medium, '(cpc|paid|social|email|referral)') THEN traffic_source.medium
        ELSE (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium')
    END AS medium,
    
    COUNT(DISTINCT CONCAT(user_pseudo_id, '.', 
                          (SELECT CAST(value.int_value AS STRING) 
                           FROM unnest(event_params) WHERE key = 'ga_session_id'))) AS session_count
    
FROM `analytics_123456789.events_20230831`
WHERE event_name = 'page_view'
AND (SELECT value.int_value FROM unnest(event_params) WHERE key = 'entrances') = 1
-- Uncomment the lines below if required
-- AND (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') IS NOT NULL
-- AND (SELECT value.string_value FROM unnest(event_params) WHERE key = 'medium') <> 'organic'
GROUP BY 1, 2
ORDER BY 3 DESC;

An important point to mention is the joining of session_id with user_pseudo_id. I’ve done this because session_id isn’t unique on its own, so pairing it with user_pseudo_id addresses this issue. Even after this adjustment, you’ll observe some discrepancies, though they’re more aligned with the GA4 UI. These discrepancies arise from various reasons, including:

  • Googles data driven models: https://support.google.com/google-ads/answer/6394265?hl=en-GB

  • The use of approx. count in the UI: https://developers.google.com/analytics/blog/2022/hll#hll_implementation_in_google_analytics_4_properties

  • The reporting ID you have set: https://support.google.com/analytics/answer/10976610?hl=en

  • Timezone differences, by default the event_timestamp is in UTC on your BigQuery dataset.

It’s essential to consider several factors. While modelling your data to align with the GA4 UI is advantageous for reporting, there’s value in understanding the genuine origins of your sessions without any modelling.


For instance, if a user session is from a direct source, it might indicate that users have bookmarked or favourited your website and are potentially more engaged. Always keep in mind the specific scope of your analysis and reporting, and choose the approach that delivers the data you need.


This summary is based on my research, and I welcome any insights or discoveries you might have regarding GA4 UI replication. My next post will be how to replicate the default channel grouping consistently on your GA4 session and traffic attribution.


41 views0 comments

Comments


bottom of page