A materialization is how dbt physically persists your model. dbt-databricks supports five: view, table, incremental, materialized_view, and streaming_table. Each has a sweet spot; mismatches waste money and confuse debugging.
The five
| Materialization | What it is | Cost profile |
|---|---|---|
view | A SQL view; computed on read | Zero storage; query-time cost |
table | A full table rebuild every run | Write cost every run; cheap reads |
incremental | Merge/append/overwrite of new rows only | Low write per run after first; complex |
materialized_view | Databricks-managed MV (delegated to DLT) | Databricks decides refresh cadence |
streaming_table | Reads a Delta change feed or stream | Continuous; latency-over-cost |
ephemeral | Not a materialization; a CTE inlined into callers | No storage; inflates downstream SQL |
Decision framework
Walk the questions in order. Stop at the first "yes".
- Is the model 1:1 with a source table and only renames/casts? →
view(staging default). - Is the model a small reusable building block not exposed downstream? →
ephemeralorview(intermediate default). - Does the full-refresh cost hurt and the source has a reliable watermark? →
incremental. - Is the model an aggregation of a large source that Databricks could maintain for you? →
materialized_view. - Does latency matter more than cost, and is there a streaming source? →
streaming_table. - Otherwise →
table.
Defaults per layer
# dbt_project.yml
models:
my_project:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table
Override per-model with a config() block when the rule of thumb does not fit.
view — the default you should reach for
Views cost nothing to maintain. They always reflect the latest source data. They are the right default for staging and most intermediates.
Views stop being right when:
- The SQL is expensive and gets run many times per day (push to
table). - The source is enormous and every query scans the whole thing (push to
incrementalif there is a watermark, or tomaterialized_viewif it is an aggregation).
table — when writes are cheaper than the reads would be
A table materialization does CREATE OR REPLACE TABLE ... AS ... every run. You pay the full rewrite cost on the schedule, in exchange for cheap reads the rest of the time.
table is the right mart default when:
- The model fits the warehouse in a reasonable time (say, under ten minutes).
- Downstream reads are more frequent than writes.
Move to incremental when the full-refresh cost becomes painful.
incremental — the one that earns its complexity
Incremental models merge, append, or overwrite only the new data. They are the single largest source of pain in a production dbt project. They are also the single largest source of cost savings when the source is big.
Read the dedicated incremental model guide before building your first one. The failure modes (drift, schema change, unique-key violations) are shape-specific.
materialized_view — hand the refresh logic to Databricks
materialized_view delegates to Delta Live Tables under the hood. Databricks owns the incrementalization: no is_incremental(), no unique_key debate, no on_schema_change config. You write the aggregation query; DLT figures out what to change.
The trade-off:
- You get less control over when and how the refresh happens.
- Some operations (complex joins, certain window functions) are not supported inside an MV.
Good default for aggregation-heavy gold tables where "eventually consistent, roughly hourly" is fine.
streaming_table — when latency beats cost
streaming_table reads from a Delta change feed or a streaming source (Kafka, Kinesis). Use it when:
- Consumers need data within seconds, not hours.
- The source produces a change feed you can subscribe to.
- You are willing to pay for continuous compute.
Not appropriate for dashboards. Very appropriate for operational surfaces (fraud detection, live product counts).
ephemeral — useful, sometimes
An ephemeral model is not materialized at all. dbt inlines it into the callers' compiled SQL as a CTE.
Pros:
- No storage cost.
- Keeps intermediate logic DRY.
Cons:
- Does not appear as its own step in the Databricks query history. If the compiled SQL is slow, you cannot isolate which ephemeral is to blame.
- Inflates compiled SQL length when referenced by many models.
Rule of thumb: ephemeral for intermediates used by one or two downstream models, view when used by more.
Picking a strategy for incremental
Five strategies, all Delta-only unless noted:
| Strategy | Use when |
|---|---|
merge (default) | Updates happen to existing rows |
append | Events are immutable (pure inserts) |
insert_overwrite | You rewrite whole partitions, like "today" |
replace_where | You rewrite a predicate-scoped slice |
microbatch (1.9+) | Late-arriving events in a time-series fact |
Start with merge if you have updates, append if you do not. Move to insert_overwrite or replace_where if the merge cost hurts and the rewrite boundary is clean.
Physical layer: clustering
On Databricks, every table and incremental should use liquid clustering unless you have a specific reason to partition.
{{ config(
materialized='table',
liquid_clustered_by=['customer_id', 'event_date']
) }}
Liquid clustering adapts as query patterns shift and eliminates the "wrong partition key" mistake. Fall back to partition_by only when you need hard isolation (GDPR region boundaries, for instance) or you are on a runtime too old for clustering.
See also
- Incremental models guide — building your first one.
- Materializations reference — exhaustive config options per materialization.
- Model authoring standards — Causeway's rules for picking materializations.