Edit local, execute remote

Concepts VSCode

The single most important idea in modern data tooling: edit locally, execute remotely. Internalize it and every other choice — extension, language server, inner loop, debugger — falls out naturally. Ignore it and you end up with a slow laptop running small samples of big data, or a browser tab for every compute system, or both.

What lives where

The physics are simple. Two categories of work, two places they belong:

Work	Where it belongs	Why
Editing, linting, formatting, navigation, autocomplete	Laptop	Sub-second feedback. Latency to a remote compilation is a productivity killer.
Git operations, PR review, diff viewing, commit authoring	Laptop	Git is local by design; round-tripping to a cloud IDE for `git commit` is absurd.
Type checking via Pylance, Ruff formatting	Laptop	Language servers need fast filesystem access. Cloud-hosted LSPs add 100 ms to every keystroke.
AI autocomplete (Copilot, Cursor)	Laptop for the input, cloud for the model	The agent runs on the user's screen. The model runs where the provider hosts it.
Fast unit tests (pure-Python, in-memory)	Laptop	Seconds matter. Don't send a `assert 1 + 1 == 2` to a Spark cluster.
Spark / PySpark jobs that touch real data	Cloud	Real data does not fit on your laptop. Databricks Connect handles the handoff.
dbt builds against prod-sized warehouses	Cloud	The warehouse runs the SQL. The editor compiles the Jinja and inspects results.
Airflow DAG scheduling in production	Cloud	Airflow on a laptop is a local development convenience, not a production deployment.
Model training, LLM inference, GPU workloads	Cloud	GPUs are expensive. Keep them centralized and scheduled.

VS Code becomes the one surface where both halves meet. The laptop provides the keyboard and screen. The cloud provides the compute and data. Extensions hide the boundary so cleanly that most of the time you forget which side runs what.

Why this pattern wins

Before VS Code + remote-execution extensions, data creators faced three bad options:

All local. Install a tiny Spark cluster on your laptop, sample the data down to something that fits in RAM, run against the sample, pray the real data behaves the same. Sampling bias kills half of production bugs before they are even seen.
All remote. Live in a cloud IDE (Databricks workspace UI, Snowflake Snowsight, AWS Cloud9). Lose every editor affordance you cared about: real keyboard shortcuts, a real Git client, real language servers, real AI agents. Productivity collapses to the lowest common denominator.
Tabbed browsers. Edit in VS Code, tab to the Databricks UI to run, tab to Snowsight to query, tab to GitHub to review, tab to Slack to ask a question. Every tab switch fragments attention and spawns errors.

The edit-local-execute-remote pattern dissolves the trade-off. The editor experience is as rich as a desktop IDE can make it. The compute is as real as production can get. You do not choose between them.

How the seams are hidden

Three examples of how the pattern works in practice.

Databricks Connect

The Databricks extension plus the databricks-connect Python client gives you a local SparkSession that is really a remote one:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()
df = spark.table("main.silver.subscription_events")
df.filter(df.status == "active").count()

The SparkSession.builder call returns an object that looks like a local Spark driver but forwards every DataFrame operation to the configured remote cluster. Pandas-style exploration runs on cluster memory. Debugger breakpoints work against the local Python code. df.show() streams results back and prints them in the integrated terminal.

You write what reads like local PySpark. The cluster runs it. Inspection feels local because the Python process on your laptop is what holds the driver state.

dbt compile-and-inspect

dbt compilation is itself partly local and partly remote:

The compilation (Jinja resolution, {{ ref() }} substitution, {{ var() }} injection) is pure Python metadata work. It happens on your laptop in milliseconds.
The execution of the compiled SQL is a warehouse operation. It happens in Snowflake or Databricks SQL, against real (or dev-schema) data.

The dbt extensions (Fusion LSP or dbt Power User) show the compiled SQL inline before you save. You read the final SQL, hover to inspect CTEs, and confirm the compile output matches your mental model — all without leaving the editor.

Note

The compile step is the "local" in edit-local-execute-remote. That is what makes the inner loop fast for dbt: you catch half of your bugs in compiled SQL before any warehouse credits are spent.

Astro CLI

Airflow is different: the compute is light, but the scheduler is heavy. The pattern there is develop locally, deploy remotely:

astro dev start launches a complete Airflow stack in Docker on your laptop: scheduler, triggerer, webserver, Postgres. Hot reload reflects DAG edits in seconds.
Tests run against the local stack. Unit tests on callables run against local Python. Integration tests run against the local Airflow.
Deploy pushes the validated project to a managed Airflow environment in the cloud (Astronomer, AWS MWAA, or a self-hosted deployment).

The laptop is the iteration surface. The cloud is the production surface. You never iterate on production DAGs by editing them in the web UI.

When the pattern breaks

The pattern has limits. Recognize them and work around them.

Data-gravity local exploration. Exploratory analysis of a 200 MB CSV does not need the cloud; opening it locally with Polars or DuckDB is faster. Do that.
Air-gapped environments. If your shop forbids outbound connections to Databricks from developer laptops, you cannot use Databricks Connect. Fall back to DevPod with the compute-adjacent VM as the editor host.
Regulated data that cannot leave the cluster. Schema exploration and output samples may be regulated. Even a df.show() streams rows to the laptop. If that is disallowed, the editor must run on the cluster side (Remote-SSH or a cloud-hosted devpod), not the laptop side.
Notebooks that want state to live on the cluster. The %%sql and %run magics in the Databricks workspace do not translate one-to-one to local files. When you need that experience, the workspace UI is the right surface.

These are exceptions. The default is still edit-local-execute-remote for the 95% of work that tolerates it.

Implications for the toolchain

Once the pattern is settled, the rest of the toolchain arranges itself:

Language servers run locally. Pylance, the dbt LSP, and SQLFluff all parse code on the laptop for sub-second feedback.
AI agents read local code. Copilot, Claude Code, and Cursor index and read the repo from the laptop. Permissions stay local.
MCP servers bridge the gap. When the agent needs to query the remote warehouse, the MCP server (Databricks, dbt Power User) runs the query and returns the result. The agent does not move to the cloud; the compute comes to the agent.
CI is cloud-only. CI builds against real warehouses, real clusters, real managed Airflow. The laptop's job is to get a change to the point where CI will accept it.

The discipline

The pattern works only if you stay in the editor. The moment you tab to a browser to "just quickly" check something in the Databricks UI, the loop breaks. Every tool and convention in the rest of this documentation exists to make sure you do not have to leave.

What lives where

Why this pattern wins

How the seams are hidden

Databricks Connect

dbt compile-and-inspect

Astro CLI

When the pattern breaks

Implications for the toolchain

The discipline

See also