CAPYSQUASH

Architecture Deep Dive

How pgsquash-engine works under the hood

ARCHITECTURE DEEP DIVE

Understanding the internals of pgsquash-engine's migration consolidation system.

This guide explains how pgsquash-engine analyzes, consolidates, and validates PostgreSQL migrations with parser-grade accuracy.

SYSTEM ARCHITECTURE

HIGH-LEVEL PIPELINE

1

SQL PARSING

PostgreSQL's own parser (pg_query_go) parses every statement

2

OBJECT TRACKING

Track lifecycle of every database object (tables, indexes, functions, etc.)

3

DEPENDENCY RESOLUTION

Build dependency graph to determine safe operation ordering

4

CONSOLIDATION

Merge related operations based on safety level

5

DOCKER VALIDATION

Verify schema equivalence using PostgreSQL in Docker

1. SQL PARSING

PostgreSQL Parser Integration

pgsquash-engine uses pg_query_go, which wraps PostgreSQL's actual C parser. This is the same parser that PostgreSQL itself uses.

Why this matters:

  • 100% accuracy - No custom parser bugs or edge cases
  • Future-proof - Automatically supports new PostgreSQL syntax
  • Battle-tested - PostgreSQL parser has decades of production use

Parsing Flow

// Simplified example
func parseSQL(sql string) (*ParseTree, error) {
    // pg_query_go wraps libpg_query (PostgreSQL parser)
    tree, err := pg_query.Parse(sql)
    if err != nil {
        return nil, fmt.Errorf("parse error: %w", err)
    }

    // Extract statement information
    for _, stmt := range tree.Stmts {
        extractStatementInfo(stmt)
    }

    return tree, nil
}

What Gets Parsed

Supported statement types (100+):

  • DDL: CREATE, ALTER, DROP, RENAME
  • Tables, indexes, constraints, sequences
  • Functions, procedures, triggers
  • Views, materialized views
  • Extensions, schemas, types
  • RLS policies, grants, roles
  • Comments, ownership changes

Parse Tree Structure

Each statement becomes a structured Abstract Syntax Tree (AST):

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE
)

Becomes:

CreateStmt
├── TableName: "users"
├── Columns:
│   ├── ColumnDef
│   │   ├── Name: "id"
│   │   ├── Type: SERIAL
│   │   └── Constraint: PRIMARY KEY
│   └── ColumnDef
│       ├── Name: "email"
│       ├── Type: VARCHAR(255)
│       └── Constraint: UNIQUE

2. OBJECT TRACKING

Lifecycle Tracking System

pgsquash-engine maintains a complete history of every database object across all migrations.

Tracked information:

  • Creation statement and timestamp
  • All modifications (ALTER statements)
  • Dependencies on other objects
  • Final state or deletion

Object Types Tracked

SCHEMA OBJECTS

  • ► Tables
  • ► Indexes (regular, unique, partial)
  • ► Constraints (PK, FK, CHECK, UNIQUE)
  • ► Sequences
  • ► Views & Materialized Views

CODE OBJECTS

  • ► Functions
  • ► Procedures
  • ► Triggers
  • ► Custom Types
  • ► Extensions

Example Tracking

Migration 001: CREATE TABLE users (id SERIAL);
Migration 005: ALTER TABLE users ADD COLUMN email VARCHAR(255);
Migration 012: ALTER TABLE users ADD COLUMN name VARCHAR(100);
Migration 023: ALTER TABLE users DROP COLUMN name;

Tracker stores:
users {
    created: migration_001,
    columns: {
        id: { added: 001, type: SERIAL, current: true },
        email: { added: 005, type: VARCHAR(255), current: true },
        name: { added: 012, dropped: 023, current: false }
    }
}

Consolidation Intelligence

The tracker enables smart decisions:

Scenario 1: Column added then dropped

-- Original (2 migrations)
ALTER TABLE users ADD COLUMN temp VARCHAR(50);
ALTER TABLE users DROP COLUMN temp;

-- Consolidated (0 migrations)
-- Entire lifecycle is removed (never existed in final schema)

Scenario 2: Incremental column additions

-- Original (3 migrations)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);

-- Consolidated (1 migration)
CREATE TABLE users (
    id SERIAL,
    email VARCHAR(255),
    name VARCHAR(100)
);

3. DEPENDENCY RESOLUTION

Dependency Graph

pgsquash-engine builds a directed acyclic graph (DAG) of dependencies between database objects.

Dependency types:

  • Table → Column - Table must exist before adding columns
  • Column → Index - Column must exist before indexing it
  • Table → Foreign Key - Referenced table must exist first
  • Function → Table - Function may reference tables
  • Trigger → Function - Trigger depends on trigger function
  • View → Table - View depends on underlying tables

Graph Construction

type DependencyGraph struct {
    nodes map[string]*Node
    edges map[string][]string
}

type Node struct {
    objectName string
    objectType string
    statement  *Statement
    deps       []string
}

func buildGraph(statements []*Statement) *DependencyGraph {
    graph := &DependencyGraph{
        nodes: make(map[string]*Node),
        edges: make(map[string][]string),
    }

    for _, stmt := range statements {
        node := extractDependencies(stmt)
        graph.addNode(node)
    }

    return graph
}

Topological Sort

To ensure safe ordering, pgsquash-engine performs topological sorting:

Original migrations (wrong order):
1. CREATE INDEX idx_users_email ON users(email);
2. CREATE TABLE users (id SERIAL, email VARCHAR(255));

Dependency graph:
idx_users_email → users (index depends on table)

Topologically sorted:
1. CREATE TABLE users (id SERIAL, email VARCHAR(255));
2. CREATE INDEX idx_users_email ON users(email);

Circular Dependency Detection

If a circular dependency is detected:

-- Example circular dependency
CREATE TABLE a (id SERIAL, b_id INT REFERENCES b(id));
CREATE TABLE b (id SERIAL, a_id INT REFERENCES a(id));

Engine response:

  1. Detect cycle: a → b → a
  2. Break cycle by deferring foreign keys
  3. Emit warning to user
  4. Provide recommended fix

Parallel Processing

Independent objects can be processed in parallel:

No dependencies between:
- CREATE TABLE users (...)
- CREATE TABLE products (...)
- CREATE EXTENSION vector

These can be consolidated in any order or in parallel.

4. CONSOLIDATION

Consolidation Strategies

pgsquash-engine applies different strategies based on safety level.

Strategy 1: Sequential Operations

Pattern: Multiple ALTER statements on same table

-- Original (4 statements)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();

-- Consolidated (1 statement)
CREATE TABLE users (
    id SERIAL,
    email VARCHAR(255),
    name VARCHAR(100),
    created_at TIMESTAMP DEFAULT NOW()
);

Safety check:

  • Verify no data dependencies between statements
  • Ensure column order doesn't matter
  • Confirm no intermediate reads of partial schema

Strategy 2: Redundant Operations

Pattern: Create, drop, recreate cycles

-- Original (3 statements)
CREATE INDEX idx_users_email ON users(email);
DROP INDEX idx_users_email;
CREATE INDEX idx_users_email ON users(email);

-- Consolidated (1 statement)
CREATE INDEX idx_users_email ON users(email);

Safety check:

  • Verify final operation matches desired state
  • Ensure no performance implications from intermediate states
  • Check for DROP CASCADE scenarios

Strategy 3: Column Renames

Pattern: Rename chains

-- Original (3 statements)
ALTER TABLE users RENAME COLUMN name TO full_name;
ALTER TABLE users RENAME COLUMN full_name TO user_name;
ALTER TABLE users RENAME COLUMN user_name TO display_name;

-- Consolidated (1 statement)
ALTER TABLE users RENAME COLUMN name TO display_name;

Strategy 4: Type Changes

Pattern: Multiple type alterations

-- Original (2 statements)
ALTER TABLE users ALTER COLUMN id TYPE BIGINT;
ALTER TABLE users ALTER COLUMN id SET DEFAULT gen_random_uuid();

-- Consolidated (1 statement - if safe)
ALTER TABLE users
    ALTER COLUMN id TYPE BIGINT,
    ALTER COLUMN id SET DEFAULT gen_random_uuid();

Safety check (CRITICAL):

  • Type changes can cause data loss
  • Verify data compatibility
  • Check for dependent objects (indexes, foreign keys)
  • Consider performance impact

Safety Level Impact

StrategyParanoidConservativeStandardAggressive
Sequential Operations☑ Same table only☑ Same table only☑ Cross-table☑ Full merge
Redundant Operations☑ Exact matches☑ Semantic matches☑ All redundant☑ All + optimizations
Rename Chains☒ Keep separate☑ Direct chains☑ All chains☑ All + dead code removal
Type Changes☒ Never merge☑ Safe types only☑ Compatible types☑ All (with warning)

Rule Engine

Consolidation rules are applied via a pluggable rule system:

type Rule interface {
    Name() string
    Apply(statements []*Statement) ([]*Statement, error)
    SafetyLevel() SafetyLevel
}

type ConsolidateTableAlters struct{}

func (r *ConsolidateTableAlters) Apply(stmts []*Statement) ([]*Statement, error) {
    // Group ALTER statements by table
    byTable := groupByTable(stmts)

    var result []*Statement
    for table, alters := range byTable {
        // Check if all alters can be merged
        if canMerge(alters) {
            merged := mergeAlters(alters)
            result = append(result, merged)
        } else {
            result = append(result, alters...)
        }
    }

    return result, nil
}

5. DOCKER VALIDATION

Validation Architecture

After consolidation, pgsquash-engine proves schema equivalence using PostgreSQL itself.

Process:

  1. Spin up two PostgreSQL containers
  2. Run original migrations in Container A
  3. Run consolidated migrations in Container B
  4. Export schemas from both containers
  5. Diff schemas byte-for-byte
  6. Report any differences

Schema Dumping

# Container A (original)
pg_dump --schema-only --no-owner --no-privileges db > original.sql

# Container B (consolidated)
pg_dump --schema-only --no-owner --no-privileges db > consolidated.sql

# Comparison
diff original.sql consolidated.sql

What Gets Validated

COMPLETE SCHEMA VALIDATION

Structural validation:

  • Table structures (columns, types, defaults)
  • Indexes (including partial indexes, expressions)
  • Constraints (PK, FK, CHECK, UNIQUE, NOT NULL)
  • Sequences (start values, increments)
  • Views and materialized views
  • Functions and procedures (signatures, bodies)
  • Triggers (timing, events, functions)
  • Extensions (versions, configurations)
  • Types (enums, composites, domains)
  • RLS policies (expressions, roles)

Data validation (optional):

  • Sample data insertion
  • Query result comparison
  • Constraint enforcement testing

Validation Failures

If schemas don't match:

Example failure:

VALIDATION FAILED

Difference found in table 'users':
  Original:  email VARCHAR(255) UNIQUE
  Squashed:  email VARCHAR(255)

Missing constraint: users_email_key

Recommendation: Review migration 047 for UNIQUE constraint

Engine response:

  1. Abort consolidation
  2. Report exact differences
  3. Provide fix recommendations
  4. Preserve original files (safe)

Performance Optimization

Validation can be slow for large schemas. Optimizations:

Parallel container startup:

// Start both containers simultaneously
var wg sync.WaitGroup
wg.Add(2)

go func() {
    defer wg.Done()
    startContainer("original", originalMigrations)
}()

go func() {
    defer wg.Done()
    startContainer("consolidated", consolidatedMigrations)
}()

wg.Wait()

Schema caching:

  • Cache pg_dump output for unchanged migrations
  • Invalidate cache only when migrations change
  • Speeds up repeated validations

Selective validation:

# Only validate changed objects
capysquash squash migrations/*.sql --validate-changed-only

STREAMING ARCHITECTURE

For large migration sets (500+ files), pgsquash-engine uses streaming to reduce memory usage.

Streaming Pipeline

Input Files → Parser → Object Tracker → Consolidator → Output Files
     ↓           ↓           ↓              ↓              ↓
   Read 50   Parse AST   Track state   Apply rules   Write batch
     ↓           ↓           ↓              ↓              ↓
   Read 50   Parse AST   Track state   Apply rules   Write batch
     ↓           ↓           ↓              ↓              ↓
    ...         ...         ...            ...            ...

Benefits:

  • Constant memory usage (~100MB regardless of migration count)
  • Handles 10,000+ migrations
  • Real-time progress updates

Enable streaming:

capysquash squash migrations/*.sql --streaming --batch-size=50

PERFORMANCE BENCHMARKS

Test environment: M1 MacBook Pro, 16GB RAM, Docker Desktop

Migration set: 287 files, 15 MB total

PhaseTimeMemory
Parsing1.2s45 MB
Tracking0.8s32 MB
Dependency Resolution0.6s28 MB
Consolidation1.1s50 MB
Validation (Docker)8.5s180 MB
Total12.2s180 MB peak

With streaming enabled: 11.8s, 95 MB peak

AI INTEGRATION (OPTIONAL)

When AI providers are configured, pgsquash-engine can perform semantic analysis.

Function Equivalency Detection

-- Migration 045
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
    RETURN a + b;
END;
$$ LANGUAGE plpgsql;

-- Migration 089
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
    RETURN a + b;  -- same logic, different formatting
END;
$$ LANGUAGE plpgsql;

AI analysis:

  • Compare function bodies semantically
  • Detect identical logic despite formatting differences
  • Suggest removing duplicate redefinition

Dead Code Detection

AI can identify unused functions, triggers, or views based on:

  • No references in codebase
  • No usage in other migrations
  • Semantic understanding of dependencies

Configuration

export ANTHROPIC_API_KEY="sk-ant-..."
capysquash analyze-deep migrations/*.sql

NEXT STEPS

How is this guide?

On this page