Skip to content

Data Transformation

Data transformation is handled by SQLMesh and runs inside BigQuery. All transformation code lives in the transform/ directory.

Layout

transform/
├── config.yaml                          # SQLMesh gateway and model defaults
├── metadata.py                          # Incremental model utilities
├── macros/
├── audits/
│   └── assert_no_nulls.sql              # Reusable null-check audit
└── models/
    ├── staging/
    │   ├── stg_facebook_ads__performance.sql
    │   ├── stg_google_ads__performance.sql
    │   ├── stg_google_analytics__sessions.sql
    │   ├── stg_google_sheets__inventory.sql
    │   ├── stg_google_sheets__programs.sql
    │   ├── stg_google_sheets__students.sql
    │   ├── stg_paypal__transactions.sql
    │   └── stg_stripe__charges.sql
    └── marts/
        ├── mart_enrollment__ad_attribution.sql
        ├── mart_finance__revenue_by_program.sql
        └── mart_inventory__stock_levels.sql

Configuration

SQLMesh connects to BigQuery via a service account:

gateways:
  bigquery:
    connection:
      type: bigquery
      method: service-account
      project: american-beauty-institute
      keyfile: ${GOOGLE_APPLICATION_CREDENTIALS}
      location: US

model_defaults:
  dialect: bigquery
  start: '2024-01-01'
  cron: '@daily'

All models default to the BigQuery dialect, start backfilling from 2024-01-01, and run on a daily schedule.

Datasets

Models are organized into three BigQuery datasets:

Dataset Purpose
raw Landing zone. Tables written by the extract-load pipeline.
staging Type-cast and cleaned versions of raw tables.
marts Business-level aggregations answering specific questions.

Incremental Processing

Most models use INCREMENTAL_BY_TIME_RANGE. SQLMesh passes @start_date and @end_date variables to each model's WHERE clause, so only new partitions are processed on each run. The loaded_at column (added by transform/metadata.py) tracks when each row was last written.