Guides, concepts, and references for creators and operators. Every doc is written to be read end to end, not scanned for a single line.
From a CSV on disk to a Silver contract in fifteen minutes.
From zero to a running job on a Causeway Databricks workspace. Twenty minutes, no notebooks required.
Your first dbt model against the Causeway lakehouse, start to finish in about twenty minutes.
Connect Power BI Desktop to a Databricks SQL warehouse, build a model against a gold table, publish a thin report. About thirty minutes.
From empty directory to a running DAG on local Astro in about fifteen minutes. Assumes less experience with Airflow.
From a fresh laptop to a working data-creator environment: extensions installed, interpreter wired, first dbt model running, in under thirty minutes.
Schema, SLA, policy: why the three live together.
Staging, intermediate, marts — what each layer is for, why you should resist inventing a fourth, and how it maps to the medallion architecture on Databricks.
dbt-databricks supports five materializations. This is the decision framework for picking the right one per model.
Three physically distinct compute offerings, one right default per workload class, and the decision framework for getting it right the first time.
Three products consolidated into one data-engineering plane. Knowing which piece does what prevents architectural flailing.
The object model that governs every table, view, volume, and function on a Causeway Databricks workspace. What lives where, who can touch what, and why lineage comes for free.
Three ways Power BI can talk to Databricks, one right default, and the traps that trip teams up on AWS deployments and DirectQuery.
Four modes, increasingly overlapping, with a clear decision tree. When Import stops being the right default.
Datasets are called semantic models now and the term matters. The canonical pattern for one model per domain, many thin reports, and how Databricks metric views close the double-definition problem.
The most common Airflow mistake is treating it as a compute engine. Internalize this one distinction and every other decision follows.
Airflow 3.0 shipped in April 2025 and reshaped the model. The migration reality, the net-new capabilities, and the things that will break your old DAGs.
Three ways to say 'run B after A'. They are not equivalent. The decision framework for which to reach for when.
The organizing pattern behind a modern VS Code data-creator setup: the editor stays on the laptop, the compute stays in the cloud, and the extensions wire the two together invisibly.
Copilot, Claude Code, Cursor, Continue, Cline, Amazon Q: what each is good at, how they coexist, and how to avoid paying for four of them.
How extensions extend, where they live, how to pin and audit them, and why the marketplace is a supply-chain surface that deserves governance.
Walk a Silver dataset through review and into Gold.
Moving a model from full refresh to incremental without breaking downstream consumers. Step-by-step with Databricks-specific choices.
Running only modified models and their downstream, deferring everything else to prod. The pattern that keeps CI under five minutes as the project grows.
One Airflow task per dbt model, data-aware scheduling, and how to avoid the single-BashOperator trap.
YAML in, deployed resources out. The only sanctioned way to ship anything to a Causeway Databricks workspace.
The first five minutes after a dbt run fails: classify the error, identify the blast radius, and pick the right recovery path.
Managed Postgres inside Databricks. When to reach for it, the adoption ladder, and the Postgres fundamentals that managed does not absolve you from.
A walkthrough of building an LDP (formerly Delta Live Tables) from source to silver, with expectations, CDC, and the modes that trip teams up.
Cluster will not start, cluster terminates unexpectedly, cluster runs but queries hang. The 5-minute triage and the next 20.
DirectQuery performance is a joint responsibility between your Power BI model and your warehouse. The full tuning checklist.
Stop using the UI scheduler. The Enhanced Refresh REST API, incremental policies, and Airflow integration that prevent stale dashboards after a failed upstream build.
Stop committing .pbix. The folder format and TMDL files that finally make Power BI diff-friendly, review-friendly, and merge-friendly.
Deploy PBIP artifacts through dev, staging, and prod workspaces with a service principal, BPA gates, and post-deploy refresh tests.
The idempotency, atomicity, and DAG-shaping rules that let an Airflow pipeline survive contact with a backfill. Walks through a realistic hourly ingest DAG.
The hard part of orchestration is what happens when things go wrong. Retries are for transient failures, pools are for rate-limited resources, sensors are for not-ready-yet. Get the distinction right or you build flakiness.
The on-call procedure for a failed task: classify, check blast radius, pick a recovery path. Five minutes to the first decision.
Stop polling. Trigger DAGs when the data arrives, whether the producer is another DAG or an external system.
Reproducible dev environments, shipped in the repo, identical for every engineer. The fix for 'works on my machine' that actually holds.
What fast feedback actually looks like per stack — the 5-to-30-second write-run-inspect-fix cycle that divides productive data creators from slow ones.
Launch configs per workload, conditional breakpoints, data breakpoints, query-profile reading. The single highest-leverage skill for data engineers.
Give Claude Code, Cursor, or Copilot real tools: Databricks managed MCP, dbt Power User embedded MCP, GitHub, and the scoping rules that keep agents safe.
The subset of dbt commands you will use daily, with selector syntax and the flags that actually matter.
The Databricks CLI subcommands you actually use day-to-day, in the order you hit them.
Three production tools, distinct roles, and the commands you actually use day to day.
The subset of Airflow and Astro CLI commands you use daily, plus the REST API fallback when the CLI does not cover something.
The `code` CLI, the command palette, the shortcuts that save seconds per iteration, and the Tasks that save minutes per day.
The Causeway rules for how a dbt model should look. Naming, structure, materialization, testing, and contracts. Enforced at review time.
What a Causeway dbt project must satisfy before the first prod deploy, and what it must keep satisfying after. Enforced at the Gold-promotion gate.
Causeway's rules for how a Databricks workload is structured, named, and deployed. Enforced at review time; exceptions require an RFD.
What a Causeway Databricks workload must satisfy before its first prod deploy, and what it must keep satisfying. Enforced at the promote-to-prod gate.
Python coding standards and best practices for the Causeway data platform.
The Causeway rules for a Power BI semantic model: shape, connectivity, DAX, storage mode, versioning. Enforced in PR review via BPA.
SQL coding standards and best practices for the Causeway data platform.
What a shared semantic model must satisfy before its first prod deploy, and keep satisfying after. Enforced at the Certified-endorsement gate.
The Causeway rules for a DAG: shape, dependencies, retries, resource gating, deployment. Enforced at review time.
What an Airflow deployment must satisfy before the first production DAG runs, and keep satisfying after. Enforced at the promote-to-prod gate.
The Causeway rules for a data-creator repo's `.vscode/`, `.devcontainer/`, and agent configuration. Enforced in PR review.
What the paved-path VS Code environment must satisfy before a team calls it 'done,' and keep satisfying after. The operational gate platform owns.
Reference documentation for the Causeway platform data model, including core entities and relationships.
The full config matrix for every dbt-databricks materialization, including incremental strategies, clustering, compute routing, and schema-change handling.
Symptom-first lookup for the errors you hit weekly: compilation, database, incremental schema drift, permissions, package conflicts.
Detailed reference for configuring the Causeway platform, including configuration files and environment variables.
Exhaustive config options for Databricks SQL warehouses: sizing, scaling, auto-stop, Photon, type selection.
Symptom-first lookup for the errors you hit most on a Databricks workspace: cluster launches, UC permissions, Delta writes, init scripts, Lakeflow pipelines.
The handful of DAX rules that cover 80% of production performance and correctness issues, plus the patterns to avoid.
The six concurrency knobs, how they interact, and how to tune for the bottleneck actually limiting you.
Symptom-first lookup for the errors on-call hits weekly. Task failures, scheduler issues, resource exhaustion.
Every extension ID the paved path ships, what it does, what settings it introduces, and the known sharp edges.
The settings.json, launch.json, and tasks.json recipes every data-creator workspace ships, plus the anti-patterns to avoid.