Data engineers who debug well move measurably faster than those who do not. VS Code supports four debugging patterns relevant to data work, and every data creator should be fluent in at least two of them. This guide walks through each, with real launch.json and settings.json examples.
The default stance: always debuggable
Configure every workload to be debuggable before you ship it. Debugging after the fact — "add a print, run again, add another print, run again" — is the loop you are trying to escape. A breakpoint is worth twenty print statements.
Important
If you cannot set a breakpoint on a line and have it hit, your launch config is wrong. Fix the launch config before you debug the bug. Ten minutes spent on a good launch config pays back the first time you hit a real bug.
Launch configs live in the repo
Create .vscode/launch.json. Commit it. This is shared project infrastructure, not personal preference:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true
}
]
}
Press F5 (or use the Run and Debug sidebar) to launch. Breakpoints in the current file are hit. Output streams to the integrated terminal.
1. Pure Python debugging
Basic
The config above is sufficient for a standalone script. Variations:
- Run a specific module:
"module": "mypackage.cli"instead of"program". - Pass args:
"args": ["--verbose", "input.csv"]. - Set env:
"env": { "LOG_LEVEL": "DEBUG" }. - Step into library code: flip
justMyCodetofalsewhen you genuinely need to understand what the library is doing.
Pytest under the debugger
{
"name": "Pytest: Current File",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": ["${file}", "-s", "-v"],
"console": "integratedTerminal",
"justMyCode": true
}
Set a breakpoint inside the test or the code under test. Run with F5. Breakpoint hits. Variables pane shows everything in scope.
Tip
The -s flag disables pytest's stdout capture so your print() statements render in real time. You want this during debugging, not during CI.
Conditional breakpoints
Right-click a breakpoint in the gutter → Edit Breakpoint. Three options:
- Expression: breaks only when the condition is true. Useful for loops:
if customer_id == "cust_abc". - Hit count: breaks on the Nth pass.
>=10breaks on iteration 10. - Log message: does not break, just logs. Replaces
print()without editing the code.
Logpoints in particular are underused. Set a logpoint like customer_id={customer_id} status={status} and get a print-style trace without touching the source.
2. PySpark via Databricks Connect
The Databricks extension provides a "Debug current file with Databricks Connect" entry point. Under the hood it is a debugpy launch with the right env vars. You can also author the config yourself:
{
"name": "Python: Debug with Databricks Connect",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": false,
"env": {
"SPARK_LOCAL_IP": "127.0.0.1",
"PYSPARK_PYTHON": "${command:python.interpreterPath}"
}
}
Set breakpoints in:
- Local code (before and after Spark calls). Always hits.
- User-defined functions (UDFs). Hits only when the UDF runs in Python mode; JVM-side UDFs do not hit the debugger.
Inspect DataFrames in the Variables pane. The pane displays schema, partition count, and a sampled preview.
What does not work
- Breakpoints inside JVM code. Spark itself runs on the JVM; Python-side debugpy cannot see into it. Read the Spark UI's Query Profile for JVM-side inspection.
- Breakpoints inside generated code. Spark's Catalyst compiles queries to bytecode. You cannot step through a generated
doConsumemethod.
Warning
justMyCode=true hides library internals, which usually is what you want. justMyCode=false steps into pyspark, databricks.connect, and everything else. The first time you do this, your debugger session will feel overwhelming. Use sparingly.
3. dbt debugging
dbt does not run under a Python debugger in any useful way — dbt is mostly Jinja templating plus warehouse SQL execution. The high-leverage debugging paths are different:
Inspect compiled SQL
The dbt LSP (Fusion) and dbt Power User both render compiled SQL inline. Open the model, pick Preview compiled SQL from the command palette or side panel. The compiled output shows:
- Resolved
{{ ref() }}names. - Resolved
{{ var() }}values. - Resolved Jinja conditionals and loops.
Half of dbt bugs live in the compile step, not the query logic. Read the compiled SQL first.
Run a selector set
When a downstream test fails, isolate:
dbt build --select +fct_failed_test
Builds the failing model and every upstream model. If the upstream still passes, the bug is in fct_failed_test. If an upstream now fails, the bug is upstream.
Query Profile on the warehouse
Every warehouse has a query-profile UI: Databricks SQL's Query History, Snowflake's Query Profile, BigQuery's Execution Graph. dbt's log lines include a query-history link. Open it.
What to look for:
- Full-table scans where an index was expected.
- Skew. A stage where one task takes 10× longer than the others.
- Spill to disk. A stage that wrote shuffle data to disk instead of RAM.
- Explode-then-aggregate patterns. A
CROSS JOINthat materializes a huge intermediate.
Tip
Teach every data creator to read a Databricks SQL Query Profile. It is the single highest-leverage skill for performance debugging, across every stack.
dbt-specific logpoints
{{ log(...) }} emits at compile time, not run time. Use it for debugging Jinja:
{% set customers = run_query("SELECT customer_id FROM dim_customer") %}
{{ log("customer count: " ~ customers | length, info=true) }}
The log line lands in dbt.log and (with info=true) in the console.
4. Airflow DAG debugging
Airflow has several levels of debugging, each useful in different conditions.
Task callable unit tests
The fastest level. If the task callable is a pure function, test it directly:
def test_extract_events():
result = extract_events(logical_date="2026-04-20T00:00:00+00:00")
assert len(result) == 24
Run under debugpy like any other pytest. Breakpoints hit. No Airflow runtime needed.
astro dev run dags test
Runs an entire DAG synchronously with full stack traces:
astro dev run dags test my_dag 2026-04-20T00:00:00+00:00
Failures print the full traceback. No scheduler, no queues, no triggerer.
In-container debugging
To attach the debugger to a task running inside the local Astro stack:
import debugpy
debugpy.listen(("0.0.0.0", 5678))
debugpy.wait_for_client()
debugpy.breakpoint()
Launch config on the VS Code side:
{
"name": "Python: Attach to Astro",
"type": "debugpy",
"request": "attach",
"connect": { "host": "localhost", "port": 5678 },
"pathMappings": [
{ "localRoot": "${workspaceFolder}/dags", "remoteRoot": "/usr/local/airflow/dags" }
]
}
Trigger the DAG; the task pauses on wait_for_client until VS Code attaches. This is heavyweight; use dags test first.
Warning
Never leave debugpy.wait_for_client() in a DAG that will deploy. A deployed DAG hangs forever waiting for an attach that never comes.
5. Docker debugging
Some workloads run inside a container (dbt Python models, custom operators, sidecars). VS Code's Docker extension and Dev Containers extension let you attach to running containers:
{
"name": "Python: Docker Attach",
"type": "debugpy",
"request": "attach",
"connect": { "host": "localhost", "port": 5678 },
"pathMappings": [
{ "localRoot": "${workspaceFolder}", "remoteRoot": "/app" }
]
}
The container must expose port 5678 and start with debugpy listening. Config the container entrypoint:
CMD ["python", "-m", "debugpy", "--listen", "0.0.0.0:5678", "-m", "my_app"]
6. The debug console
While paused, the debug console is a REPL in the debuggee's context. Evaluate any expression:
> df.count()
12345
> df.columns
['customer_id', 'event_ts', 'status']
> df.filter(df.status == "churned").show(5)
This is usually faster than stepping through the code. Learn the console; it saves minutes.
7. Data breakpoints (Python's watchdog)
Python does not have native data breakpoints, but the Pylance watch feature and debugpy's breakpoint() integration are close. Set a watch expression in the Variables pane; when it changes, step yourself back to the change site.
For Spark work: df.show(5) at checkpoints in the pipeline and inspect in the terminal. The cost of a show on a cached DataFrame is near-zero; it pays back the inspection you would otherwise do via the Spark UI.
Common failure modes
"My breakpoint is not hit"
The file under the debugger and the file being executed are different. Usually because:
- An older
.pycis cached. Delete__pycache__/and retry. - The test is discovered under a different root. Check
pytest'srootdir. - The devcontainer volume-mount path differs from the workspace path. Fix
pathMappings.
"The debugger starts but I get a ModuleNotFoundError"
The launch config's interpreter differs from the one that has your dependencies. Check the launch config's python field or set python.defaultInterpreterPath in workspace settings.
"Step-in jumps 10 lines ahead"
justMyCode=true is skipping library code. Flip to false when stepping through third-party behavior matters, back to true when you return to app code.
"Breakpoint hit, but variables show <can not evaluate>"
The debugger's evaluation context is a different frame than the breakpoint. Click the correct frame in the Call Stack pane.
See also
- Inner loop — the baseline loops the launch configs accelerate.
- Settings reference — full
launch.jsonrecipes. - Workspace standards — launch configs the paved path ships.
- Commands reference — debugger keyboard shortcuts.