dbt model authoring standards

Standards dbt

These standards bind every dbt model in the Causeway monorepo. They are not recommendations. If a model needs to deviate, deviate in the model's yml file with a comment explaining why and a link to the RFD or decision.

1. Project layout

Three layers, no fourth.

models/
  staging/
  intermediate/
  marts/

Staging (stg_<source>__<table>.sql): one-to-one with a source table. Renames, casts, coerces nulls. No joins, no aggregations, no filters. Always a view.
Intermediate (int_<verb>__<noun>.sql): composes staging into reusable building blocks. Default ephemeral; promote to view if referenced by three or more downstream models. Never exposed to BI or external consumers.
Marts (fct_<grain>.sql, dim_<entity>.sql, agg_<metric>.sql): what downstream reads. Default table; promote to incremental when full-refresh cost hurts; materialized_view for DLT-managed aggregations; streaming_table when latency beats cost.

Warning

Do not invent a fourth layer. "metrics/", "presentation/", "semantic/" always feel clarifying the day they are introduced and always become naming arguments within a quarter. The three layers compose everything.

2. Naming

Thing	Pattern	Example
Staging	`stg_<source>__<table>`	`stg_salesforce__accounts`
Intermediate	`int_<verb>__<noun>`	`int_joined__orders_customers`
Fact mart	`fct_<grain>`	`fct_orders`
Dimension mart	`dim_<entity>`	`dim_customers`
Aggregate mart	`agg_<metric>`	`agg_revenue_daily`
Test file	`test_<assertion>.sql`	`test_no_orphan_orders.sql`
Column	`snake_case`, no abbrev. except `id`	`customer_id`, `placed_at`, `revenue_usd`
Boolean	`is_` / `has_` prefix	`is_active`, `has_subscription`
Timestamp	`_at` suffix	`created_at`, `updated_at`
Date	`_date` suffix	`event_date`, `placed_date`
Ingestion timestamp	`_loaded_at`	reserved; never use for anything else

Currency columns carry the currency in the name: amount_usd, spend_eur.

3. File structure inside a model

Every non-trivial model follows this shape:

with <input1> as ( select * from {{ ref('…') }} ),
     <input2> as ( select * from {{ ref('…') }} ),

     <transform_step_1> as (
         ...
     ),

     <transform_step_2> as (
         ...
     ),

     final as (
         ...
     )

select * from final

Rules:

Read every ref() into a named CTE at the top, before any logic. Never subquery a ref(). Keeps compiled SQL readable in Databricks query history.
One CTE per logical step. Name the step, not the output table: renamed, filtered, deduplicated, joined, final.
Always end with select * from final. The pattern is a promise about shape; it also makes the final column list debuggable with a trailing -- select * from joined limit 100 during development.

4. Jinja discipline

Macros are for repetition, not cleverness. If a macro is used once, inline it. If it is used twice, wait for a third. Intermediate models observe as nodes in the DAG; macros do not.
Reach for packages before writing macros. dbt_utils for pivots, surrogate keys, date spines. dbt_expectations for expressive tests. dbt-labs/codegen for boilerplate.
Env gating via target.name. Differences between dev and prod go in {% if target.name == 'prod' %} blocks, not environment variables read in SQL.
No per-developer vars. Dev scoping happens via generate_schema_name. If two developers need different values at build time, they need different models.

5. Materialization defaults

Baked into dbt_project.yml:

models:
  causeway:
    staging:
      +materialized: view
      +schema: staging
      +tags: [staging]
    intermediate:
      +materialized: ephemeral
      +tags: [intermediate]
    marts:
      +materialized: table
      +schema: gold
      +tags: [marts]

Override per-model only when the default is wrong. Justify the deviation in a comment next to config().

Note

"Justify" means one line pointing at a measurable reason: "full refresh takes 42m, incremental needed" or "consumed by four dashboards that poll every minute, needs to be a table". Opinions about "feels right" do not count.

6. Incremental models

If you write materialized='incremental', you commit to all of these:

A real, reliable watermark column.
A deduplication CTE before any merge.
is_incremental() filter anchored to max(watermark) from {{ this }}, never to wall-clock time.
on_schema_change set explicitly. append_new_columns is the Causeway default; ignore is banned without a reviewed RFD.
A unique test on the unique_key.
A not_null test on the watermark column.

See Build your first incremental model for the worked example.

Danger

on_schema_change: 'ignore' silently drops columns that disappear upstream. It is a data-loss path that hides until someone asks why a column is empty. Never set it. If you have a reason, write an RFD first.

7. Physical layer

Liquid clustering by default for every table and incremental. Pick clustering keys from the columns consumers filter and join on, typically the primary entity key and a time column.
partition_by only for hard isolation requirements (GDPR region separation, multi-tenant where tenants must never co-locate).
Compute routing via named compute in profiles.yml. Serverless SQL warehouses for everything SQL. All-purpose clusters only for Python models.

{{ config(
    materialized='table',
    liquid_clustered_by=['customer_id', 'event_date'],
    tblproperties={'delta.autoOptimize.optimizeWrite': 'true'}
) }}

8. Tests

Tests fall into four categories. Do not conflate them.

Category	Where declared	Runs against	Catches
Generic data tests	`schema.yml`	Real data	Data quality issues
Singular tests	`.sql` in `tests/`	Real data	One-off business rules
Unit tests	`schema.yml`	Static mocks	Logic regressions
Model contracts	`schema.yml`	Schema metadata	Breaking changes to public interfaces

Required per model:

Model type	Required tests
Staging	`not_null` + `unique` on the primary key; `not_null` on the ingestion timestamp
Intermediate	`not_null` on any required join keys
Fact mart	`not_null` + `unique` on the grain key; `not_null` on all required dims; `relationships` to referenced dims
Dim mart	`not_null` + `unique` on the entity key; contract enforced
Incremental model	All of the above plus `not_null` on the watermark

Run tests with dbt build, not dbt run + dbt test split. build interleaves so a failing test blocks downstream, which is almost always what you want.

9. Model contracts

Contracts are for public interfaces: marts that BI, other dbt projects, or services consume. Contracts are not for internal intermediates.

models:
  - name: dim_customers
    config:
      contract:
        enforced: true
    columns:
      - name: customer_id
        data_type: bigint
        constraints:
          - type: not_null
          - type: primary_key
      - name: email
        data_type: string
        constraints: [{type: not_null}]

When a contract breaks, create dim_customers_v2 alongside and version with versions: rather than breaking consumers in place. Consumers migrate on their own cadence.

10. Sources and freshness

Every external table enters the project through a declared source. Every source has freshness thresholds.

sources:
  - name: bronze
    database: prod
    schema: bronze
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _loaded_at
    tables:
      - name: raw_orders

Source freshness checks run out-of-band via dbt source freshness, wired into orchestration (Airflow DAG, Cosmos task group), not into CI. A stale source should page someone, not fail a PR.

11. Packages

Prefer community packages over writing macros. Vendor-controlled and version-pinned in packages.yml.

Acceptable packages in Causeway projects:

dbt-labs/dbt_utils — date spines, surrogate keys, pivots.
calogica/dbt_expectations — statistical and distributional tests.
elementary-data/elementary — anomaly detection and test-run history.
dbt-labs/codegen — generate schema.yml and source declarations.

Any package not on that list requires review in a PR with explicit justification.

12. Review checklist

PRs touching models/ must satisfy:

[ ] Naming conforms to the table in section 2.
[ ] File structure follows the CTE pattern in section 3.
[ ] No macro defined that is used fewer than twice.
[ ] Materialization matches layer defaults, or deviation is justified.
[ ] Liquid clustering on any new table / incremental; no partition_by without an isolation justification.
[ ] Required tests present per section 8.
[ ] Public marts have contract: enforced: true.
[ ] Sources declared with freshness thresholds.
[ ] No new vars: for per-developer values.
[ ] Slim CI build passes against the stored prod manifest.

1. Project layout

2. Naming

3. File structure inside a model

4. Jinja discipline

5. Materialization defaults

6. Incremental models

7. Physical layer

8. Tests

9. Model contracts

10. Sources and freshness

11. Packages

12. Review checklist

See also