Back arrow
Go back to all blog posts
Make the most of your Google Analytics data with Orchest and Meltano

Make the most of your Google Analytics data with Orchest and Meltano

Google Analytics is a widespread solution to track the activity of new and recurrent visitors to your website or mobile app. The most recent version, Google Analytics 4 (GA4) has an event-based model that can measure many different kinds of data.

In this blog post, we will walk you through how to export the raw events generated by Google Analytics 4 to your data warehouse, using Orchest for orchestration, Meltano for Extraction & Loading (EL), and Metabase for visualization.

Why?

Google Analytics 4 already offers visualization capabilities through its predefined reports, and you can create custom reports too. However, you might want to use a custom visualization tool, or combine the GA4 data with some other data sources you have.

Google Analytics properties have native integration with Google BigQuery, a cloud data warehouse that lets you run highly performant queries of large datasets. This allows you to export your raw event data for further manipulation easily.

Note: Alternatively, you could use the Google Analytics Data API v1 (still in beta) to gain programmatic access to GA4 report data. However, in this blog post we will focus on getting access to the raw event data.

There is a chance that you are already using another cloud data warehouse (AWS Redshift, Snowflake) or that you don’t want to use Google BigQuery for your data analysis. Orchest will allow you to orchestrate your EL pipelines so that you can copy your data from BigQuery to the destination that better suits your needs.

Several solutions exist for data ingestion, like Fivetran, Keboola, Airbyte, and Meltano. The advantage of Meltano is that it is open-source (unlike the first two) and leverages the Singer specification and ecosystem of open-source connectors (unlike Airbyte).

Requirements

  • A Google Cloud Platform account
  • (Optional) A Google Analytics 4 property of your own

Google Cloud will serve as the source of raw GA4 data, which you will then extract and load to your warehouse using Orchest. The architecture will look as follows:

1. Get your Google Analytics 4 data in Google BigQuery

BigQuery, part of Google Cloud Platform, has a free tier that includes 1 TB of querying per month and 10 GB of storage each month. In addition, standard Google Analytics 4 properties have a daily BigQuery Export limit of 1 million events. If your site has low traffic, most likely you won’t have to worry about costs!

As a first step, create a Google Cloud project that will contain all the raw event data. For the rest of the blog post, we will assume that the project ID is orchest-ga4-tutorial. Next:

  • If you don’t have your own Google Analytics 4 property, use this public dataset, which contains an obfuscated sample of an e-commerce site.

For demonstration purposes, we will use the public dataset.

You will need to generate proper credentials to access the data from the outside. For that, head to “IAM & Admin” → “Service accounts”, click on “+ Create service account”, and introduce the following information:

  • Service account name: tap-bigquery
  • Roles: “BigQuery Data Viewer”, “BigQuery Job User”

Click “Done”, then under “Actions” for the service account you just created select “Manage keys”, click “Add key” → “Create new key”, leave “JSON” selected, and download the resulting orchest-ga4-tutorial-xxxx.json. You will need this file later in your Orchest project.

2. Create your Orchest project and define the requirements

Now it’s time to go to your Orchest instance! Orchest will be responsible for orchestrating the extraction and load of the data from BigQuery to a different database for closer inspection. The destination of the data will be a PostgreSQL database configured as an Orchest service.

To set everything up, proceed as follows:

  1. Create a new Project called “ga4-orchest”.
  1. Wait for the default Environment to build.
  1. Create a new pipeline called “Main”.
  1. Get into JupyterLab and add a requirements.txt file with these contents:

-- CODE language-plain --
# requirements.txt
meltano~=2.4.0
sqlalchemy[postgresql]~=1.4.40

  1. Go back to the Environment and modify your setup script to look like this:

-- CODE language-bash --
#!/bin/bash

sudo apt-get update
sudo -E apt-get install -y --no-install-recommends libpq-dev build-essential

pip install -r requirements.txt

  1. Add a new Service with the following configuration: Name: postgres, Image: postgres, Ports: 5432, Environment variables: POSTGRES_HOST_AUTH_METHOD=trust, POSTGRES_DB=warehouse, PGDATA=/var/lib/postgresql/data/pgdata, Data directory: /var/lib/postgresql/data
  1. Create a /data/ga4-orchest directory and upload the orchest-ga4-tutorial-xxxx.json key you downloaded before to it. This way you will make sure that you are not versioning sensitive data.
  1. Finally, add a new environment variable to your project with the name GOOGLE_APPLICATION_CREDENTIALS containing the path of the JSON key.

Note: For production usage, we recommend running your databases on dedicated infrastructure, rather than relying on Orchest mount mechanisms. Read more in the documentation.

And you’re all set to start configuring the extraction and loading process. Remember to commit your changes!

3. Initialize the Meltano project

Before going any further, you will need to initialize the Meltano project. To do that, first launch a new Console (not Terminal!) from JupyterLab, and verify the current working directory:

-- CODE language-plain --
[1]: %pwd
[1]: '/project-dir'

Then, initialize the Meltano project and change the working directory accordingly:

-- CODE language-plain --
[2]: !meltano init meltano
...

[3]: %cd meltano

[4]: %pwd
[4]: '/project-dir/meltano'

Now add the tap-bigquery extractor to the project configuration, verify that it got installed correctly, and configure an appropriate start_datetime value:

-- CODE language-plain --
[5]: !meltano add extractor tap-bigquery
...

[6]: !meltano invoke tap-bigquery --help
2022-08-22T09:38:49.451413Z [info     ] Environment 'dev' is active
usage: tap-bigquery [-h] -c CONFIG [-s STATE] [-p PROPERTIES]
                   [--catalog CATALOG] [-d] [--start_datetime START_DATETIME]
                   [--end_datetime END_DATETIME]
...

[7]: !meltano config tap-bigquery set start_datetime 2022-08-23T00:00:00Z
...

So far, this is how your meltano.yml file under the meltano directory should look like:

-- CODE language-yaml --
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
plugins:
 extractors:
 - name: tap-bigquery
   variant: anelendata
   pip_url: tap-bigquery
   config:
     start_datetime: '2021-01-30T00:00:00Z'

Project files in Orchest are tracked under version control using git. This is a good moment to checkpoint your progress by making a commit, either using the JupyterLab Git UI or the integrated terminal.

4. Configure the Meltano extraction

Now it’s time to tell Meltano which streams to extract from BigQuery. Before that, you will need to create a view on BigQuery to make the data more amenable to Meltano. To do so, go to the BigQuery editor and type the following query:

-- CODE language-sql --
SELECT
 event_timestamp AS event_timestamp_int
 , CAST(TIMESTAMP_MICROS(event_timestamp) AS timestamp) AS `event_timestamp`
 , EXTRACT(
     DATE
     FROM CAST(TIMESTAMP_MICROS(event_timestamp) AS timestamp)
     AT TIME ZONE 'UTC'  -- Assume UTC timezone for reporting in Google Analytics
   ) AS `reporting_day`
 , event_name
 , event_dimensions.hostname AS `hostname`
 , event_params
 , traffic_source.name AS `traffic_source_name`
 , traffic_source.medium AS `traffic_source_medium`
 , traffic_source.source AS `traffic_source_source`
-- You can group all tables like this,
-- but bear in mind that running the query will process 1.5 GB of data!
-- FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`
ORDER BY event_timestamp_int

Save the query as a View named summarized_events_20210131 on a new dataset ga4_data_sample with location US (multi-region).

Next, add the following contents under the tap-bigquery config key in meltano.yml:

-- CODE language-yaml --
streams:
- name: ga4_obfuscated_sample_ecommerce_20210131
 table: orchest-ga4-tutorial.ga4_data_sample.summarized_events_20210131
 columns:
 - event_timestamp
 - event_name
 - hostname
 - event_params
 - traffic_source_name
 - traffic_source_medium
 - traffic_source_source
 datetime_key: event_timestamp

To verify that the authentication and the view worked, display the list of available entities from the stream:

-- CODE language-plain --
[8]: !meltano select tap-bigquery --list --all
2022-08-23T09:47:30.617376Z [info     ] Environment 'dev' is active
Legend:
selected
excluded
automatic

Enabled patterns:
*.*

Selected attributes:
[automatic] ga4_obfuscated_sample_ecommerce_20210131._etl_tstamp
[automatic] ga4_obfuscated_sample_ecommerce_20210131._sdc_batched_at
[automatic] ga4_obfuscated_sample_ecommerce_20210131._sdc_extracted_at
[automatic] ga4_obfuscated_sample_ecommerce_20210131.event_name
[automatic] ga4_obfuscated_sample_ecommerce_20210131.event_params
[automatic] ga4_obfuscated_sample_ecommerce_20210131.event_timestamp
[automatic] ga4_obfuscated_sample_ecommerce_20210131.hostname
[automatic] ga4_obfuscated_sample_ecommerce_20210131.traffic_source_medium
[automatic] ga4_obfuscated_sample_ecommerce_20210131.traffic_source_name
[automatic] ga4_obfuscated_sample_ecommerce_20210131.traffic_source_source

Meltano correctly detected the column names! 🎉

At this point, since you already filtered the columns using the BigQuery view and the configuration of the stream, you can select all the entities:

-- CODE language-plain --
[9]: !meltano select tap-bigquery --all

Your meltano.yml configuration will look like this:

-- CODE language-yaml --
version: 1
default_environment: dev
environments:
- name: dev
 config:
   plugins:
     extractors:
     - name: tap-bigquery
       select:
       - '*.*'
- name: staging
- name: prod
plugins:
 extractors:
 - name: tap-bigquery
   variant: anelendata
   pip_url: tap-bigquery
   config:
     start_datetime: '2021-01-30T00:00:00Z'
     streams:
     - name: ga4_obfuscated_sample_ecommerce_20210131
       table: orchest-ga4-tutorial.ga4_data_sample.summarized_events_20210131
       columns:
       - event_timestamp
       - event_name
       - hostname
       - event_params
       - traffic_source_name
       - traffic_source_medium
       - traffic_source_source
       datetime_key: event_timestamp

Commit your progress so far, and just one more step to go!

5. Load the data to the warehouse

The last step is instructing Meltano to load the GA4 data to our “warehouse” (our PostgreSQL service). To that end, first add the target-postgres loader:

-- CODE language-plain --
[10]: !meltano add loader target-postgres
...

[11]: !meltano invoke target-postgres --help
2022-08-23T10:08:28.737588Z [info     ] Environment 'dev' is active
usage: target-postgres [-h] [-c CONFIG]

optional arguments:
 -h, --help            show this help message and exit
 -c CONFIG, --config CONFIG
                       Config file

Next, set some basic configuration values for the loader using the Meltano CLI:

-- CODE language-plain --
[12]: !meltano config target-postgres set password '_np_'
!meltano config target-postgres set user postgres
!meltano config target-postgres set dbname warehouse
!meltano config target-postgres set default_target_schema public
!meltano config target-postgres set primary_key_required false
...

Now use the Orchest SDK to extract the internal hostname of the PostgreSQL service and pass that to the Meltano CLI as well:

-- CODE language-plain --
[13]: from orchest import get_service

[14]: postgres_host = get_service("postgres")["internal_hostname"]

[15]: !meltano config target-postgres set host $postgres_host
...

And finally, verify that the EL pipeline works as expected:

-- CODE language-plain --
[16]: !meltano run --no-state-update tap-bigquery target-postgres
2022-08-23T10:16:23.634664Z [info     ] Environment 'dev' is active
2022-08-23T10:16:26.243599Z [warning  ] No active environment, proceeding with stateless run! See https://docs.meltano.com/reference/command-line-interface#run for details.
2022-08-23T10:16:26.664659Z [info     ] time=2022-08-23 10:16:26 name=target_postgres level=INFO message=Table '"ga4_obfuscated_sample_ecommerce_20210131"' does not exist. Creating...
...
2022-08-23T10:16:49.876486Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True

Meltano properly extracted the data from BigQuery and loaded it to PostgreSQL 🎉 It’s an excellent moment to commit your progress!

6. Build the Orchest pipeline

Now that you got the whole thing to work from the console, it’s time to create a pipeline that will take care of running the commands on schedule. First, add a step that installs the necessary Meltano plugins if needed:

-- CODE language-bash --
#!/usr/bin/env bash
# setup.sh
# Linted with https://www.shellcheck.net/

pushd meltano || exit

# Recreates plugins from meltano.yml
meltano install

# Verify the state of the pipelines
meltano state list

Second, add a pipeline step that runs the EL:

-- CODE language-bash --
#!/usr/bin/env bash
# ga4_el.sh
# Linted with https://www.shellcheck.net/

pushd meltano || exit

meltano run tap-bigquery target-postgres

This is how your pipeline will look like:

Extra: Visualizing data with Metabase

If you made it this far, why not doing the extra legwork to visualize the data you just loaded? For example, you can add a Metabase service with the following configuration:

  • Name: metabase
  • Image: metabase/metabase:v0.44.1
  • Ports: 3000
  • Environment variables: MB_DB_FILE=/root/metabase.db
  • Data directory: /root

Upon first run, you will have to introduce the hostname and other configuration values of the PostgreSQL service. And after that, you can start exploring the data straight away! This is how a sample dashboard could look like:

The limit is your imagination!

Orchest is a pipeline orchestrator with first class support for Meltano and other open source MLOps tools. Import this example project on your Orchest Cloud containing the Jupyter notebooks with the code from this article.

Still no Orchest account? Get started for free!