Google Analytics in BigQuery, explained in one query

Google Analytics (GA) is a popular suite of analytic tools used by many companies to track customer interactions on their digital channels. Although it offers plenty of built-in capabilities for insights discovery, there are times when you want to deep dive and run your own analyses. This post will help you understand the Google Analytics data that is exported to BigQuery and how to extract the key pieces of information that you need.

Understanding the data structure

  • BigQuery stores the exported data by date, and each day is stored in its own table. For instance yesterday’s data will be stored in 1300.ga_sessions_20170315, where 1300 is the project id and 20170315 is the date in yyyymmdd format. Data of the current date is stored in an intraday table, e.g. 1300.ga_sessions_intraday_20170315.
  • Each table contains all sessions by users, one row per user session. A session is simply a sequence of pages viewed by the user (or in GA terminology, page hits).

For analytical tasks, we want to be able to identify users and sessions.

Identifying unique users

Users can be divided into two categories: logged in and not logged in (guests), of which only the former can be reliably identified. Logged in users can be associated with customers if you set and send these programmatically, either via the userId field or some custom dimensions that you define. Guests can be identified via fullVisitorIds but these are reset if user clear their cookies or use multiple devices. In fact the mapping from userIds and fullVisitorIds are N-to-N, so they can’t be reliably linked.

Take-away message: set, send, and use userId to uniquely identify customers.

Identifying unique sessions

The documentation from GA recommends using fullVisitorId + visitId to get a globally unique session identifier (within your GA data source). But for logged in users, we should actually use userId + visitStartTime to identify sessions of each user, where visitStartTime is the start time of a session. Let me illustrate with a toy example:

visitId | fullVisitorId | userId | visitStartTime
v1          | f1                     | u1         | 1000000
v1          | f2                     | u1         | 1500000

Here we have one user u1, who is mapped to two different visitor ids in two different sessions. The visitIds happen to be the same in both sessions. So using userId + visitId we would get only one session where in fact there are two. Using userId with visitTime is the right combination as an user can’t have two sessions that start at exactly the same time. If we want to be 100% certain that they are unique, we can use userId + visitId + visitStartTime.

Example reference query

Now that we know how to identify users and sessions, let me give you one reference query that covers the main concepts you need to know to work with this data.

Note that BigQuery is compliant with the SQL 2011 standard and supports complex types like Arrays and Structs. It also supports legacy SQL, but I encourage the use of the Standard SQL dialect, as in the given query below.

select 
  b.account_number
  , concat(cast(visitId as string), '_', cast(visitStartTime as string)) session_id
  , hits.type as hit_type
  , hits.hitNumber as hit_number
  , concat(hits.page.pagePathLevel1, 
      regexp_replace(hits.page.pagePathLevel2, '\\?.*$', ''))
    ) as page_level12
  , hits.appInfo.screenName
 from `project_id.ga_sessions_201702*` s, s.customdimensions as custdim,
       s.hits, `project_id.account_numbers` b
 where custdim.index = 1 and 
   custdim.value = cast(b.account_number as string) and
   timestamp_diff(timestamp_seconds(visitStartTime), b.first_online, HOUR) < 24
   and hits.type in ('APPVIEW', 'PAGE');

The query extracts all pages visited by each user on apps and websites, within their first  day online. Here are the main points:

  • account_number is used in place of userId for logged in users. This comes from an external data source, for example your customer database table. This is set and send to GA via the first custom dimension, which we retrieve with the condition custdim.index = 1.
  • unique session identifier is given by concatenating visitId and visitStartTime, as discussed above.
  • hits is an array of struct containing information about each hit / page view. There are several different hit types but here we limit to ‘APPVIEW’ for app interactions and ‘PAGE’ for website interactions
  • the sessions table is implicitly joined with its column hits to flatten the table (getting one row per record)
  • hits.hit_number gives us the order of page views within a session
  • wildcard is used to qualify against the tables (hence dates) we query, here we are looking at data in Feb 2017 only
  • hits.page.pagePathLevel{1 to 4} gives the web page, and hits.appInfo.screenName gives the app page
  • timestamp_diff, timestamp_seconds are date time functions

References:

Side note

BigQuery web interface is not yet fully fledged. In fact, I find it quite limiting at first as it does not allow creating tables directly from query and has only one window for writing query. But auto-completion (which works for table names, columns and functions) and pop-up documentation are absolutely two killer features. I like it much better in this respect compared to the boring SQLWorkbench/J client that I’ve been using.

Recommended books for data scientists

In this post I’d like to share some of my recommended books for learning data science and machine learning, both in theory and and practice. Fellow practitioners, let me know your favourite books or any other related resources, I’d be keen to check out some new books and add them to my list.

Theory

These are all foundational textbooks in machine learning. If you study at least one of them in depth, by which I mean formulating models, deriving and implementing the main inference algorithms, and doing the exercises. The books can be quite technical if you’re new to machine learning, but once you stick through one, you’ll find others quite accessible.

The Elements of Statistical Learning (ESL), by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
One of the classics, there’s also an online course and a new textbook accompanied by R code.

Pattern recognition and machine learning (PRML), by Christopher Bishop

Similar to ESL, this highly regarded book is another must-read.

Machine Learning: A Probabilistic Perspective, by Kevin R Murphy

If you study PRML thoroughly, you’ll be familiar with most contents in Murphy’s book. Nevertheless a fun and comprehensive book with a strong focus on principled, probabilistic approach to modelling. It also comes with code in Matlab.

Probabilistic Graphical Models, Daphne Koller and Nir Friedman

Graphical models provide a framework for representation, inference, as well as learning of probabilistic models. This powerful framework provides a unifying view to many ML models which otherwise may be viewed as just a bunch of disparate models. There’s also an online course on Coursera. 

Reinforcement learning, an introduction, by Richard S. Sutto and Andrew G. Barto

Despite still a draft, the second release is well-written and motivates the concepts and applications of RL really well.

Neural networks and deep learning, by Michael Nielsen
Deep Learning
, Ian Goodfellow and Yoshua Bengio and Aaron Courville

Michael Nielsen’s book is more hands-on and contains some cool interactive contents to aid understanding, while Goodfellow et al is more comprehensive. I recommend reading them in the given order.

*Gaussian processes for machine learning, by Carl E. Rasmussen and Christopher K. I. Williams

*The last book is on Gaussian processes, my PhD research topic. You don’t really need to know it for doing practical data science, but still a good reference. The first few chapters present the Bayesian approach to modelling and are worth reading.

Practice

Data science for business, by Foster Provost and Tom Fawcett

This book is accessible to non-technical audience like business managers. It also provides some sound principles on how to execute data science projects. Highly recommended.

Applied predictive modelling, by Kjell Johnson and Max Kuhn

Written by the author of the popular R package caret, this is a must-read for those practising data science. It contains many practical tricks and advices.

 

Data Mining Techniques: For Marketing, Sales, and Customer Relationship, by Gordon S. Linoff and Michael J. A. Berry

Don’t let the title mislead you, this is a good read on data science techniques in general, not just in CRM space.

Data preparation for data mining, by Dorian Pyle

Published in 1999 but still very relevant a day, this book can serves as a good checklist of things to inspect when preparing data for analysis.

Bandit algorithms for website optimization, by John Myles White

This book presents standard multi-armed bandit algorithms and comes with implementations in several languages.

Practical data science with R, John Mount and Nina Zumel

Not as polished as Johnson and Kuhn’s book but has few neat techniques worth knowing.