ARCHITECTURE DEEP DIVE

Understanding the internals of pgsquash-engine's migration consolidation system.

This guide explains how pgsquash-engine analyzes, consolidates, and validates PostgreSQL migrations with parser-grade accuracy.

SYSTEM ARCHITECTURE

HIGH-LEVEL PIPELINE

SQL PARSING

PostgreSQL's own parser (pg_query_go) parses every statement

OBJECT TRACKING

Track lifecycle of every database object (tables, indexes, functions, etc.)

DEPENDENCY RESOLUTION

Build dependency graph to determine safe operation ordering

CONSOLIDATION

Merge related operations based on safety level

DOCKER VALIDATION

Verify schema equivalence using PostgreSQL in Docker

1. SQL PARSING

PostgreSQL Parser Integration

pgsquash-engine uses pg_query_go, which wraps PostgreSQL's actual C parser. This is the same parser that PostgreSQL itself uses.

Why this matters:

100% accuracy - No custom parser bugs or edge cases
Future-proof - Automatically supports new PostgreSQL syntax
Battle-tested - PostgreSQL parser has decades of production use

Parsing Flow

// Simplified example
func parseSQL(sql string) (*ParseTree, error) {
    // pg_query_go wraps libpg_query (PostgreSQL parser)
    tree, err := pg_query.Parse(sql)
    if err != nil {
        return nil, fmt.Errorf("parse error: %w", err)
    }

    // Extract statement information
    for _, stmt := range tree.Stmts {
        extractStatementInfo(stmt)
    }

    return tree, nil
}

What Gets Parsed

Supported statement types (100+):

DDL: CREATE, ALTER, DROP, RENAME
Tables, indexes, constraints, sequences
Functions, procedures, triggers
Views, materialized views
Extensions, schemas, types
RLS policies, grants, roles
Comments, ownership changes

Parse Tree Structure

Each statement becomes a structured Abstract Syntax Tree (AST):

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE
)

Becomes:

CreateStmt
├── TableName: "users"
├── Columns:
│   ├── ColumnDef
│   │   ├── Name: "id"
│   │   ├── Type: SERIAL
│   │   └── Constraint: PRIMARY KEY
│   └── ColumnDef
│       ├── Name: "email"
│       ├── Type: VARCHAR(255)
│       └── Constraint: UNIQUE

2. OBJECT TRACKING

Lifecycle Tracking System

pgsquash-engine maintains a complete history of every database object across all migrations.

Tracked information:

Creation statement and timestamp
All modifications (ALTER statements)
Dependencies on other objects
Final state or deletion

Object Types Tracked

SCHEMA OBJECTS

► Tables
► Indexes (regular, unique, partial)
► Constraints (PK, FK, CHECK, UNIQUE)
► Sequences
► Views & Materialized Views

CODE OBJECTS

► Functions
► Procedures
► Triggers
► Custom Types
► Extensions

Example Tracking

Migration 001: CREATE TABLE users (id SERIAL);
Migration 005: ALTER TABLE users ADD COLUMN email VARCHAR(255);
Migration 012: ALTER TABLE users ADD COLUMN name VARCHAR(100);
Migration 023: ALTER TABLE users DROP COLUMN name;

Tracker stores:
users {
    created: migration_001,
    columns: {
        id: { added: 001, type: SERIAL, current: true },
        email: { added: 005, type: VARCHAR(255), current: true },
        name: { added: 012, dropped: 023, current: false }
    }
}

Consolidation Intelligence

The tracker enables smart decisions:

Scenario 1: Column added then dropped

-- Original (2 migrations)
ALTER TABLE users ADD COLUMN temp VARCHAR(50);
ALTER TABLE users DROP COLUMN temp;

-- Consolidated (0 migrations)
-- Entire lifecycle is removed (never existed in final schema)

Scenario 2: Incremental column additions

-- Original (3 migrations)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);

-- Consolidated (1 migration)
CREATE TABLE users (
    id SERIAL,
    email VARCHAR(255),
    name VARCHAR(100)
);

3. DEPENDENCY RESOLUTION

Dependency Graph

pgsquash-engine builds a directed acyclic graph (DAG) of dependencies between database objects.

Dependency types:

Table → Column - Table must exist before adding columns
Column → Index - Column must exist before indexing it
Table → Foreign Key - Referenced table must exist first
Function → Table - Function may reference tables
Trigger → Function - Trigger depends on trigger function
View → Table - View depends on underlying tables

Graph Construction

type DependencyGraph struct {
    nodes map[string]*Node
    edges map[string][]string
}

type Node struct {
    objectName string
    objectType string
    statement  *Statement
    deps       []string
}

func buildGraph(statements []*Statement) *DependencyGraph {
    graph := &DependencyGraph{
        nodes: make(map[string]*Node),
        edges: make(map[string][]string),
    }

    for _, stmt := range statements {
        node := extractDependencies(stmt)
        graph.addNode(node)
    }

    return graph
}

Topological Sort

To ensure safe ordering, pgsquash-engine performs topological sorting:

Original migrations (wrong order):
1. CREATE INDEX idx_users_email ON users(email);
2. CREATE TABLE users (id SERIAL, email VARCHAR(255));

Dependency graph:
idx_users_email → users (index depends on table)

Topologically sorted:
1. CREATE TABLE users (id SERIAL, email VARCHAR(255));
2. CREATE INDEX idx_users_email ON users(email);

Circular Dependency Detection

If a circular dependency is detected:

-- Example circular dependency
CREATE TABLE a (id SERIAL, b_id INT REFERENCES b(id));
CREATE TABLE b (id SERIAL, a_id INT REFERENCES a(id));

Engine response:

Detect cycle: a → b → a
Break cycle by deferring foreign keys
Emit warning to user
Provide recommended fix

Parallel Processing

Independent objects can be processed in parallel:

No dependencies between:
- CREATE TABLE users (...)
- CREATE TABLE products (...)
- CREATE EXTENSION vector

These can be consolidated in any order or in parallel.

4. CONSOLIDATION

Consolidation Strategies

pgsquash-engine applies different strategies based on safety level.

Strategy 1: Sequential Operations

Pattern: Multiple ALTER statements on same table

-- Original (4 statements)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();

-- Consolidated (1 statement)
CREATE TABLE users (
    id SERIAL,
    email VARCHAR(255),
    name VARCHAR(100),
    created_at TIMESTAMP DEFAULT NOW()
);

Safety check:

Verify no data dependencies between statements
Ensure column order doesn't matter
Confirm no intermediate reads of partial schema

Strategy 2: Redundant Operations

Pattern: Create, drop, recreate cycles

-- Original (3 statements)
CREATE INDEX idx_users_email ON users(email);
DROP INDEX idx_users_email;
CREATE INDEX idx_users_email ON users(email);

-- Consolidated (1 statement)
CREATE INDEX idx_users_email ON users(email);

Safety check:

Verify final operation matches desired state
Ensure no performance implications from intermediate states
Check for DROP CASCADE scenarios

Strategy 3: Column Renames

Pattern: Rename chains

-- Original (3 statements)
ALTER TABLE users RENAME COLUMN name TO full_name;
ALTER TABLE users RENAME COLUMN full_name TO user_name;
ALTER TABLE users RENAME COLUMN user_name TO display_name;

-- Consolidated (1 statement)
ALTER TABLE users RENAME COLUMN name TO display_name;

Strategy 4: Type Changes

Pattern: Multiple type alterations

-- Original (2 statements)
ALTER TABLE users ALTER COLUMN id TYPE BIGINT;
ALTER TABLE users ALTER COLUMN id SET DEFAULT gen_random_uuid();

-- Consolidated (1 statement - if safe)
ALTER TABLE users
    ALTER COLUMN id TYPE BIGINT,
    ALTER COLUMN id SET DEFAULT gen_random_uuid();

Safety check (CRITICAL):

Type changes can cause data loss
Verify data compatibility
Check for dependent objects (indexes, foreign keys)
Consider performance impact

Safety Level Impact

Strategy	Paranoid	Conservative	Standard	Aggressive
Sequential Operations	☑ Same table only	☑ Same table only	☑ Cross-table	☑ Full merge
Redundant Operations	☑ Exact matches	☑ Semantic matches	☑ All redundant	☑ All + optimizations
Rename Chains	☒ Keep separate	☑ Direct chains	☑ All chains	☑ All + dead code removal
Type Changes	☒ Never merge	☑ Safe types only	☑ Compatible types	☑ All (with warning)

Rule Engine

Consolidation rules are applied via a pluggable rule system:

type Rule interface {
    Name() string
    Apply(statements []*Statement) ([]*Statement, error)
    SafetyLevel() SafetyLevel
}

type ConsolidateTableAlters struct{}

func (r *ConsolidateTableAlters) Apply(stmts []*Statement) ([]*Statement, error) {
    // Group ALTER statements by table
    byTable := groupByTable(stmts)

    var result []*Statement
    for table, alters := range byTable {
        // Check if all alters can be merged
        if canMerge(alters) {
            merged := mergeAlters(alters)
            result = append(result, merged)
        } else {
            result = append(result, alters...)
        }
    }

    return result, nil
}

5. DOCKER VALIDATION

Validation Architecture

After consolidation, pgsquash-engine proves schema equivalence using PostgreSQL itself.

Process:

Spin up two PostgreSQL containers
Run original migrations in Container A
Run consolidated migrations in Container B
Export schemas from both containers
Diff schemas byte-for-byte
Report any differences

Schema Dumping

# Container A (original)
pg_dump --schema-only --no-owner --no-privileges db > original.sql

# Container B (consolidated)
pg_dump --schema-only --no-owner --no-privileges db > consolidated.sql

# Comparison
diff original.sql consolidated.sql

What Gets Validated

COMPLETE SCHEMA VALIDATION

Structural validation:

Table structures (columns, types, defaults)
Indexes (including partial indexes, expressions)
Constraints (PK, FK, CHECK, UNIQUE, NOT NULL)
Sequences (start values, increments)
Views and materialized views
Functions and procedures (signatures, bodies)
Triggers (timing, events, functions)
Extensions (versions, configurations)
Types (enums, composites, domains)
RLS policies (expressions, roles)

Data validation (optional):

Sample data insertion
Query result comparison
Constraint enforcement testing

Validation Failures

If schemas don't match:

Example failure:

VALIDATION FAILED

Difference found in table 'users':
  Original:  email VARCHAR(255) UNIQUE
  Squashed:  email VARCHAR(255)

Missing constraint: users_email_key

Recommendation: Review migration 047 for UNIQUE constraint

Engine response:

Abort consolidation
Report exact differences
Provide fix recommendations
Preserve original files (safe)

Performance Optimization

Validation can be slow for large schemas. Optimizations:

Parallel container startup:

// Start both containers simultaneously
var wg sync.WaitGroup
wg.Add(2)

go func() {
    defer wg.Done()
    startContainer("original", originalMigrations)
}()

go func() {
    defer wg.Done()
    startContainer("consolidated", consolidatedMigrations)
}()

wg.Wait()

Schema caching:

Cache pg_dump output for unchanged migrations
Invalidate cache only when migrations change
Speeds up repeated validations

Selective validation:

# Only validate changed objects
capysquash squash migrations/*.sql --validate-changed-only

STREAMING ARCHITECTURE

For large migration sets (500+ files), pgsquash-engine uses streaming to reduce memory usage.

Streaming Pipeline

Input Files → Parser → Object Tracker → Consolidator → Output Files
     ↓           ↓           ↓              ↓              ↓
   Read 50   Parse AST   Track state   Apply rules   Write batch
     ↓           ↓           ↓              ↓              ↓
   Read 50   Parse AST   Track state   Apply rules   Write batch
     ↓           ↓           ↓              ↓              ↓
    ...         ...         ...            ...            ...

Benefits:

Constant memory usage (~100MB regardless of migration count)
Handles 10,000+ migrations
Real-time progress updates

Enable streaming:

capysquash squash migrations/*.sql --streaming --batch-size=50

PERFORMANCE BENCHMARKS

Test environment: M1 MacBook Pro, 16GB RAM, Docker Desktop

Migration set: 287 files, 15 MB total

Phase	Time	Memory
Parsing	1.2s	45 MB
Tracking	0.8s	32 MB
Dependency Resolution	0.6s	28 MB
Consolidation	1.1s	50 MB
Validation (Docker)	8.5s	180 MB
Total	12.2s	180 MB peak

With streaming enabled: 11.8s, 95 MB peak

AI INTEGRATION (OPTIONAL)

When AI providers are configured, pgsquash-engine can perform semantic analysis.

Function Equivalency Detection

-- Migration 045
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
    RETURN a + b;
END;
$$ LANGUAGE plpgsql;

-- Migration 089
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
    RETURN a + b;  -- same logic, different formatting
END;
$$ LANGUAGE plpgsql;

AI analysis:

Compare function bodies semantically
Detect identical logic despite formatting differences
Suggest removing duplicate redefinition

Dead Code Detection

AI can identify unused functions, triggers, or views based on:

No references in codebase
No usage in other migrations
Semantic understanding of dependencies

Configuration

export ANTHROPIC_API_KEY="sk-ant-..."
capysquash analyze-deep migrations/*.sql

Architecture Deep Dive