How to Sync Simple Analytics Data to BigQuery using GitHub Actions
Introduction
In this how-to guide, we will show you how you can load page views and events from Simple Analytics into BigQuery on regular schedule. We will use GitHub Actions to automate the process. This is a simple, reliable (and potentially free!) way to load Simple Analytics data into BigQuery on a regular basis.
Prerequisites
Before we start, to follow this guide you will need the following:
- a Simple Analytics account to load data from,
- a BigQuery dataset to sync to,
- a GitHub repository to store the workflow and config files, and
- the
gcloud
CLI (opens in a new tab) installed and configured. You may also choose to use the Google Cloud Console UI (or Terraform) instead, but we won't cover the steps for that here.
Now, let's get started!
Step 1: Add Workflow Configuration to the GitHub repository
Step 1.1: Add the workflow file
Inside your git repository, create a new file called .github/workflows/sync_simple_analytics_to_bigquery.yml
and add the following content:
name: Sync Simple Analytics to BigQuery
on:
workflow_dispatch:
schedule:
- cron: "0 2 * * *" # daily at 02:00 UTC
jobs:
simple-analytics-to-bigquery:
timeout-minutes: 60
runs-on: ubuntu-latest
permissions:
id-token: 'write' # This required for OIDC
contents: 'read' # This is required for actions/checkout@v3
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Authenticate to Google Cloud
uses: 'google-github-actions/auth@v0'
with:
workload_identity_provider: 'projects/${{ secrets.SA_BIGQUERY_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/cloudquery-pool/providers/cloudquery-provider'
service_account: 'cloudquery-service-account@${{ secrets.SA_BIGQUERY_PROJECT_ID }}.iam.gserviceaccount.com'
- uses: cloudquery/setup-cloudquery@v3
name: Setup CloudQuery
with:
version: "v2.3.1"
- name: CloudQuery Sync
run: cloudquery sync --log-console simple-analytics.yml bigquery.yml
env:
SA_USER_ID: ${{ secrets.SA_USER_ID }}
SA_API_KEY: ${{ secrets.SA_API_KEY }}
SA_BIGQUERY_PROJECT_ID: ${{ secrets.SA_BIGQUERY_PROJECT_ID }}
SA_BIGQUERY_DATASET_ID: ${{ secrets.SA_BIGQUERY_DATASET_ID }}
This configuration allows us to run the workflow manually or on a schedule. The example above is configured to run daily at 2 a.m. UTC. The workflow will run on a self-hosted runner, which is a virtual machine hosted by GitHub. The workflow will run the following steps:
- Check out the repository
- Authenticate to Google Cloud using the google-github-actions/auth (opens in a new tab) Action
- Setup CloudQuery using the cloudquery/setup-cloudquery (opens in a new tab) Action
- Run the
cloudquery sync
command to load data from Simple Analytics into BigQuery
We will now configure the secrets and the configuration files.
Step 1.2: Add the secrets
In the GitHub repository, go to Settings > Secrets
and add the following repository secrets:
SA_USER_ID
: the user ID of your Simple Analytics account. You can find this in the Simple Analytics Account Settings (opens in a new tab).SA_API_KEY
: the API key of your Simple Analytics account. You can find this in the Simple Analytics Account Settings (opens in a new tab).SA_BIGQUERY_PROJECT_ID
: the ID of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.SA_BIGQUERY_DATASET_ID
: the ID of the BigQuery dataset where you want to load the data. You can find this in the BigQuery UI.SA_BIGQUERY_PROJECT_NUMBER
: the number of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.
Step 1.3: Add the Simple Analytics source plugin configuration file
In the git repository, create a new file called simple-analytics.yml
and add the following content:
kind: source
spec:
name: "simple-analytics"
path: "simpleanalytics/simple-analytics"
version: "v1.0.0"
tables:
["*"]
destinations:
- "bigquery"
spec:
user_id: ${SA_USER_ID}
api_key: ${SA_API_KEY}
websites:
- hostname: <your-website-hostname>
You may find the latest version on the releases page. Replace <your-website-hostname>
with the hostname of your website as defined on Simple Analytics, e.g. mywebsite.com
. You can also add more than one website, specify additional metadata that should be fetched or a period of days to fetch data for. See the Simple Analytics source plugin documentation for all the details.
Step 1.4: Add the BigQuery destination plugin configuration file
Now, let's set up the BigQuery destination. In the git repository, create a new file called bigquery.yml
and add the following content:
kind: destination
spec:
name: bigquery
path: cloudquery/bigquery
version: "v2.1.2"
write_mode: "append"
spec:
project_id: ${SA_BIGQUERY_PROJECT_ID}
dataset_id: ${SA_BIGQUERY_DATASET_ID}
You can find the latest version in the BigQuery destination plugin documentation (opens in a new tab). There are also a few more configuration options you may want to consider, like partitioning based on sync time.
Step 1.5: Push to the repository
With all this in place, make sure you to push the files to the GitHub repository. You can now run the workflow manually by going to Actions > Sync Simple Analytics to BigQuery > Run workflow
and clicking the Run workflow
button. It will run, but probably end with an error, because we still need to set up the Workload Identity to authenticate with BigQuery. Let's do that next.
Step 2: Set up Workload Identity
In this section, we will set up Workload Identity to allow the GitHub Actions runner to authenticate with BigQuery. This is required because the GitHub Actions runner is a self-hosted runner, which is not part of the Google Cloud project. We will use the google-github-actions/auth (opens in a new tab) and the steps in this guide are based on the instructions there, but fine-tuned for our use case.
Step 2.1: Set Environment Variables
First, let's set a few environment variables that we'll use throughout the guide. Set the following variables, replacing <your-project-id>
and <your-repo>
with your own values:
export PROJECT_ID=<your-project-id> # The ID of your Google Cloud project, e.g. "my-project"
export REPO=<your-repo> # e.g. cloudquery/simple-analytics-to-bigquery
Step 2.2: Create a Service Account
We will need a service account that can write to BigQuery. We will use the gcloud
CLI to create the service account and grant it the necessary permissions. Run the following commands:
gcloud iam service-accounts create "cloudquery-service-account" \
--project "${PROJECT_ID}"
If you haven't used the IAM Credentials API in the project before, it will need to be enabled:
gcloud services enable iamcredentials.googleapis.com \
--project "${PROJECT_ID}"
Now we will grant the Google Cloud Service Account permissions to access Google Cloud resources. In this case, BigQuery:
We will assign the service account the BigQuery Data Editor
role:
gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" --role=roles/bigquery.dataEditor
Step 2.3: Create a Workload Identity Pool and Provider
We will now set up Workload Identity, which the GitHub action will rely on to authenticate itself without the need for keys. You can read more about keyless authentication with GCP and GitHub Actions here (opens in a new tab).
First we need to create a pool:
gcloud iam workload-identity-pools create "cloudquery-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--display-name="Cloudquery pool"
Now let's describe the pool and set an environment variable that we'll use in subsequent steps:
export WORKLOAD_IDENTITY_POOL_ID=$(gcloud iam workload-identity-pools describe "cloudquery-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--format="value(name)")
We can use echo $WORKLOAD_IDENTITY_POOL_ID
to verify that the variable is set correctly:
echo $WORKLOAD_IDENTITY_POOL_ID
# projects/<your-project-id>/locations/global/workloadIdentityPools/cloudquery-pool
Now let's create an OIDC provider:
gcloud iam workload-identity-pools providers create-oidc "cloudquery-provider" \
--project="${PROJECT_ID}" \
--location="global" \
--workload-identity-pool="cloudquery-pool" \
--display-name="Cloudquery provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
Finally, we need to grant the service account the Workload Identity User
role:
gcloud iam service-accounts add-iam-policy-binding "cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" \
--project="${PROJECT_ID}" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/${WORKLOAD_IDENTITY_POOL_ID}/attribute.repository/${REPO}"
Step 3: Query the data!
Now that we have set up the Workload Identity, we can run the workflow again. Go to Actions > Sync Simple Analytics to BigQuery > Run workflow
and click the Run workflow
button. You should see the workflow run successfully. You can also check the logs to see that the data was successfully written to BigQuery, or get messages about any errors that occurred. After a successful workflow run you should be able to query the data in BigQuery!
Let's head over to the BigQuery console and run the following query to test it out:
SELECT * FROM `cq-analytics-v1.simple_analytics.simple_analytics_page_views` LIMIT 10
Nice, we have raw Simple Analytics data in BigQuery! We can now use this data to build dashboards, reports, and more. The GitHub Action will run daily, so the data will be kept up to date automatically.
Step 4: Post-processing the data (Optional)
Now that we have extracted and loaded our data into BigQuery, we may also want to do the "T" part of ELT: transformation as a post-processing step. For example, because the BigQuery destination only supports append
mode, we should make sure that our query results don't contain duplicates. One simple way to do this is to run a SQL to remove duplicates after the sync completes. We can do this by adding a few additional steps to the workflow that run deduplication queries as SQL.
Let's first write the queries. First we'll create deduplicate_page_views.sql
:
CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_page_views` AS
SELECT
*
FROM
`simple_analytics`.`simple_analytics_page_views`
WHERE
_cq_sync_time = (
SELECT
max(_cq_sync_time)
FROM
`simple_analytics`.`simple_analytics_page_views`
);
And similarly, we'll create deduplicate_events.sql
:
CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_events` AS
SELECT
*
FROM
`simple_analytics`.`simple_analytics_events`
WHERE
_cq_sync_time = (
SELECT
max(_cq_sync_time)
FROM
`simple_analytics`.`simple_analytics_events`
);
Now we can add these queries to the workflow. We'll add a few steps to the workflow to install the bq
tool and run the queries:
- name: Set up Cloud SDK
uses: 'google-github-actions/setup-gcloud@v1'
- name: Deduplicate page views
run: bq query --use_legacy_sql=False --flagfile deduplicate_page_views.sql
- name: Deduplicate events
run: bq query --use_legacy_sql=False --flagfile deduplicate_events.sql
Conclusion
That's it! Now we have a fully automated pipeline that extracts data from Simple Analytics, loads it into BigQuery, and deduplicates it on a regular basis. We can now use this data to build dashboards, reports, and more. To see a live repository employing this workflow, check out the cloudquery/recipes repository (opens in a new tab).