SQL Standards

This document outlines the SQL standards and best practices for the Causeway data platform. Following these standards ensures consistency, maintainability, and optimal performance across all database interactions.

Consistent SQL standards are critical for maintaining a well-functioning data ecosystem. Whether you're creating data models, writing queries, or developing data pipelines, these guidelines will help ensure your SQL code is readable, performant, and follows best practices.

Naming Conventions

Database ObjectConventionExample
TablesSnake case, plural nounscustomer_orders, product_categories
ColumnsSnake case, singular descriptive namesfirst_name, created_at
Primary KeysTable name (singular) + _idcustomer_id, order_id
Foreign KeysReferenced table (singular) + _idcustomer_id in orders table
ViewsPrefix vw_ + descriptive namevw_customer_orders_summary
FunctionsPrefix fn_ + verb + nounfn_calculate_order_total
Stored ProceduresPrefix sp_ + verb + nounsp_update_inventory
IndexesPrefix idx_ + table + columnsidx_customers_last_name
ConstraintsPrefix based on type (pk_, fk_, uq_, ck_)pk_customers, fk_orders_customers
Temporary TablesPrefix tmp_ or # (for SQL Server)tmp_order_processing, #order_processing

SQL Style Guidelines

Capitalization

Indentation and Formatting

-- Good formatting example
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM
    customers c
JOIN
    orders o ON c.customer_id = o.customer_id
WHERE
    o.order_date >= '2023-01-01'
    AND o.total_amount > 100
ORDER BY
    o.order_date DESC
LIMIT 100;

Comments

/*
 * Query to find high-value customers who made purchases in the last 30 days
 * Used for the monthly marketing campaign
 * Created by: Data Team
 * Last updated: 2023-05-15
 */
SELECT
    c.customer_id,
    c.email,
    SUM(o.total_amount) AS total_spent
FROM
    customers c
JOIN
    orders o ON c.customer_id = o.customer_id
WHERE
    o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY
    c.customer_id,
    c.email
HAVING
    SUM(o.total_amount) > 1000  -- High-value threshold defined by marketing
ORDER BY
    total_spent DESC;

Query Best Practices

Select Only Required Columns

Avoid using SELECT * in production code. Explicitly specify the columns you need to improve performance and make dependencies clear.

Instead of:

SELECT * FROM customers;

Use:

SELECT
    customer_id,
    first_name,
    last_name,
    email
FROM
    customers;

Use Appropriate JOINs

Always use the appropriate JOIN type (INNER, LEFT, RIGHT, FULL) based on your requirements. Be explicit with JOIN conditions.

-- Example of explicit JOIN
SELECT
    c.customer_id,
    c.first_name,
    o.order_id
FROM
    customers c
LEFT JOIN  -- Using LEFT JOIN because we want all customers even without orders
    orders o ON c.customer_id = o.customer_id;

Use Table Aliases

Use meaningful table aliases, especially when working with multiple tables. Use single-letter aliases only for simple queries.

Instead of:

SELECT
    a.id,
    a.name,
    b.id,
    b.date
FROM
    customers a
JOIN
    orders b ON a.id = b.customer_id;

Use:

SELECT
    cust.customer_id,
    cust.customer_name,
    ord.order_id,
    ord.order_date
FROM
    customers cust
JOIN
    orders ord ON cust.customer_id = ord.customer_id;

Optimize WHERE Clauses

Place the most restrictive conditions first. Avoid functions on indexed columns in WHERE clauses as they can prevent index usage.

Instead of:

SELECT
    *
FROM
    orders
WHERE
    EXTRACT(YEAR FROM order_date) = 2023;

Use:

SELECT
    *
FROM
    orders
WHERE
    order_date >= '2023-01-01'
    AND order_date < '2024-01-01';

Use CTEs for Complex Queries

Use Common Table Expressions (CTEs) for complex queries to improve readability and maintainability.

WITH monthly_sales AS (
    SELECT
        EXTRACT(MONTH FROM order_date) AS month,
        SUM(total_amount) AS revenue
    FROM
        orders
    WHERE
        order_date >= '2023-01-01'
    GROUP BY
        EXTRACT(MONTH FROM order_date)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_growth
FROM
    monthly_sales
ORDER BY
    month;

Data Types

Use appropriate data types for the data being stored. Here are the recommended data types for common scenarios:

Type of DataRecommended Data TypeNotes
Primary KeysINTEGER or BIGINTUse with AUTO_INCREMENT or IDENTITY
Small TextVARCHAR(n)For variable-length text with a reasonable maximum size
Large TextTEXTFor large text content without a defined limit
Integer NumbersINTEGER or BIGINTChoose based on expected range of values
Decimal NumbersDECIMAL(p,s)Use for exact numeric values like money
Floating PointFLOAT or DOUBLEUse when exact decimal precision is not required
BooleanBOOLEANUse for true/false flags
DatesDATEFor date values without time
Date and TimeTIMESTAMPWith time zone awareness when needed
TimeTIMEFor time values without a date

Common Patterns

Pagination

Use the following pattern for paginated results:

SELECT
    column1,
    column2
FROM
    table_name
ORDER BY
    sort_column
LIMIT
    page_size
OFFSET
    (page_number - 1) * page_size;

Hierarchical Data

For hierarchical data (like organizational structures), use a recursive CTE:

WITH RECURSIVE hierarchy AS (
    -- Base case: top-level categories
    SELECT
        id,
        name,
        parent_id,
        1 AS level
    FROM
        categories
    WHERE
        parent_id IS NULL

    UNION ALL

    -- Recursive case: child categories
    SELECT
        c.id,
        c.name,
        c.parent_id,
        h.level + 1
    FROM
        categories c
    JOIN
        hierarchy h ON c.parent_id = h.id
)
SELECT
    *
FROM
    hierarchy
ORDER BY
    level, name;

Pivot Tables

For dynamic pivot tables, use conditional aggregation:

SELECT
    product_id,
    SUM(CASE WHEN month = 1 THEN sales ELSE 0 END) AS jan_sales,
    SUM(CASE WHEN month = 2 THEN sales ELSE 0 END) AS feb_sales,
    SUM(CASE WHEN month = 3 THEN sales ELSE 0 END) AS mar_sales,
    SUM(sales) AS total_sales
FROM
    monthly_sales
WHERE
    year = 2023
GROUP BY
    product_id
ORDER BY
    total_sales DESC;

Anti-Patterns

SELECT *

Never use SELECT * in production code. It creates implicit dependencies, reduces query performance, and returns unnecessary data.

Instead, explicitly list required columns:

SELECT
    customer_id,
    first_name,
    last_name,
    email
FROM
    customers;

Implicit JOINs

Avoid old-style implicit joins in the WHERE clause. They are harder to read and can lead to accidental cross joins.

Avoid:

SELECT
    c.customer_id,
    o.order_id
FROM
    customers c,
    orders o
WHERE
    c.customer_id = o.customer_id;

Use explicit JOINs:

SELECT
    c.customer_id,
    o.order_id
FROM
    customers c
JOIN
    orders o ON c.customer_id = o.customer_id;

Functions on Indexed Columns

Avoid using functions on indexed columns in WHERE clauses as they prevent the database from using indexes.

Avoid:

SELECT
    *
FROM
    customers
WHERE
    LOWER(email) = 'customer@example.com';

Better approach:

SELECT
    *
FROM
    customers
WHERE
    email = 'customer@example.com';

Correlated Subqueries

Avoid correlated subqueries when possible, as they execute for each row in the outer query, leading to poor performance.

Avoid:

SELECT
    c.customer_id,
    c.first_name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM
    customers c;

Use JOIN and GROUP BY:

SELECT
    c.customer_id,
    c.first_name,
    COUNT(o.order_id) AS order_count
FROM
    customers c
LEFT JOIN
    orders o ON c.customer_id = o.customer_id
GROUP BY
    c.customer_id,
    c.first_name;

Examples

Basic Query

-- Basic query with proper formatting and style
SELECT
    customer_id,
    first_name,
    last_name,
    email,
    created_at
FROM
    customers
WHERE
    status = 'active'
    AND created_at >= '2023-01-01'
ORDER BY
    created_at DESC
LIMIT 100;

JOIN Query

-- JOIN query with table aliases and comments
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount,
    p.payment_method
FROM
    customers c
INNER JOIN  -- INNER JOIN because we only want customers with orders
    orders o ON c.customer_id = o.customer_id
LEFT JOIN   -- LEFT JOIN because some orders might not have payments yet
    payments p ON o.order_id = p.order_id
WHERE
    o.order_date BETWEEN '2023-01-01' AND '2023-12-31'
    AND o.status = 'completed'
ORDER BY
    o.order_date DESC,
    o.total_amount DESC;

Aggregation and Grouping

-- Aggregation with grouping and having
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) AS order_count,
    SUM(o.total_amount) AS total_spent,
    AVG(o.total_amount) AS avg_order_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date
FROM
    customers c
JOIN
    orders o ON c.customer_id = o.customer_id
WHERE
    o.order_date >= '2023-01-01'
    AND o.status = 'completed'
GROUP BY
    c.customer_id,
    c.first_name,
    c.last_name
HAVING
    COUNT(o.order_id) >= 3  -- Only include customers with at least 3 orders
    AND SUM(o.total_amount) > 1000  -- And who spent over $1000
ORDER BY
    total_spent DESC;

Complex Query with CTEs

-- Complex query with Common Table Expressions (CTEs)
WITH customer_orders AS (
    SELECT
        c.customer_id,
        c.first_name,
        c.last_name,
        c.email,
        COUNT(o.order_id) AS order_count,
        SUM(o.total_amount) AS total_spent
    FROM
        customers c
    JOIN
        orders o ON c.customer_id = o.customer_id
    WHERE
        o.order_date >= '2023-01-01'
        AND o.status = 'completed'
    GROUP BY
        c.customer_id,
        c.first_name,
        c.last_name,
        c.email
),
customer_segments AS (
    SELECT
        customer_id,
        first_name,
        last_name,
        email,
        order_count,
        total_spent,
        CASE
            WHEN total_spent > 5000 THEN 'Premium'
            WHEN total_spent > 1000 THEN 'Standard'
            ELSE 'Basic'
        END AS customer_segment
    FROM
        customer_orders
)
SELECT
    customer_segment,
    COUNT(*) AS customer_count,
    SUM(order_count) AS total_orders,
    SUM(total_spent) AS total_revenue,
    AVG(total_spent) AS avg_customer_value
FROM
    customer_segments
GROUP BY
    customer_segment
ORDER BY
    total_revenue DESC;

Window Functions

-- Query using window functions
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount,
    ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) AS order_sequence,
    SUM(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) AS running_total,
    AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS customer_avg_order,
    o.total_amount - LAG(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) AS order_amount_change
FROM
    customers c
JOIN
    orders o ON c.customer_id = o.customer_id
WHERE
    o.order_date >= '2023-01-01'
    AND o.status = 'completed'
ORDER BY
    c.customer_id,
    o.order_date;

Note: These SQL standards may vary slightly depending on your specific database platform (PostgreSQL, MySQL, SQL Server, etc.). Always consult the specific database documentation for platform-specific best practices.