Architecture Deep Dive
How pgsquash-engine works under the hood
ARCHITECTURE DEEP DIVE
Understanding the internals of pgsquash-engine's migration consolidation system.
This guide explains how pgsquash-engine analyzes, consolidates, and validates PostgreSQL migrations with parser-grade accuracy.
SYSTEM ARCHITECTURE
HIGH-LEVEL PIPELINE
SQL PARSING
PostgreSQL's own parser (pg_query_go) parses every statement
OBJECT TRACKING
Track lifecycle of every database object (tables, indexes, functions, etc.)
DEPENDENCY RESOLUTION
Build dependency graph to determine safe operation ordering
CONSOLIDATION
Merge related operations based on safety level
DOCKER VALIDATION
Verify schema equivalence using PostgreSQL in Docker
1. SQL PARSING
PostgreSQL Parser Integration
pgsquash-engine uses pg_query_go, which wraps PostgreSQL's actual C parser. This is the same parser that PostgreSQL itself uses.
Why this matters:
- 100% accuracy - No custom parser bugs or edge cases
- Future-proof - Automatically supports new PostgreSQL syntax
- Battle-tested - PostgreSQL parser has decades of production use
Parsing Flow
// Simplified example
func parseSQL(sql string) (*ParseTree, error) {
// pg_query_go wraps libpg_query (PostgreSQL parser)
tree, err := pg_query.Parse(sql)
if err != nil {
return nil, fmt.Errorf("parse error: %w", err)
}
// Extract statement information
for _, stmt := range tree.Stmts {
extractStatementInfo(stmt)
}
return tree, nil
}What Gets Parsed
Supported statement types (100+):
- DDL:
CREATE,ALTER,DROP,RENAME - Tables, indexes, constraints, sequences
- Functions, procedures, triggers
- Views, materialized views
- Extensions, schemas, types
- RLS policies, grants, roles
- Comments, ownership changes
Parse Tree Structure
Each statement becomes a structured Abstract Syntax Tree (AST):
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE
)
Becomes:
CreateStmt
├── TableName: "users"
├── Columns:
│ ├── ColumnDef
│ │ ├── Name: "id"
│ │ ├── Type: SERIAL
│ │ └── Constraint: PRIMARY KEY
│ └── ColumnDef
│ ├── Name: "email"
│ ├── Type: VARCHAR(255)
│ └── Constraint: UNIQUE2. OBJECT TRACKING
Lifecycle Tracking System
pgsquash-engine maintains a complete history of every database object across all migrations.
Tracked information:
- Creation statement and timestamp
- All modifications (ALTER statements)
- Dependencies on other objects
- Final state or deletion
Object Types Tracked
SCHEMA OBJECTS
- ► Tables
- ► Indexes (regular, unique, partial)
- ► Constraints (PK, FK, CHECK, UNIQUE)
- ► Sequences
- ► Views & Materialized Views
CODE OBJECTS
- ► Functions
- ► Procedures
- ► Triggers
- ► Custom Types
- ► Extensions
Example Tracking
Migration 001: CREATE TABLE users (id SERIAL);
Migration 005: ALTER TABLE users ADD COLUMN email VARCHAR(255);
Migration 012: ALTER TABLE users ADD COLUMN name VARCHAR(100);
Migration 023: ALTER TABLE users DROP COLUMN name;
Tracker stores:
users {
created: migration_001,
columns: {
id: { added: 001, type: SERIAL, current: true },
email: { added: 005, type: VARCHAR(255), current: true },
name: { added: 012, dropped: 023, current: false }
}
}Consolidation Intelligence
The tracker enables smart decisions:
Scenario 1: Column added then dropped
-- Original (2 migrations)
ALTER TABLE users ADD COLUMN temp VARCHAR(50);
ALTER TABLE users DROP COLUMN temp;
-- Consolidated (0 migrations)
-- Entire lifecycle is removed (never existed in final schema)Scenario 2: Incremental column additions
-- Original (3 migrations)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);
-- Consolidated (1 migration)
CREATE TABLE users (
id SERIAL,
email VARCHAR(255),
name VARCHAR(100)
);3. DEPENDENCY RESOLUTION
Dependency Graph
pgsquash-engine builds a directed acyclic graph (DAG) of dependencies between database objects.
Dependency types:
- Table → Column - Table must exist before adding columns
- Column → Index - Column must exist before indexing it
- Table → Foreign Key - Referenced table must exist first
- Function → Table - Function may reference tables
- Trigger → Function - Trigger depends on trigger function
- View → Table - View depends on underlying tables
Graph Construction
type DependencyGraph struct {
nodes map[string]*Node
edges map[string][]string
}
type Node struct {
objectName string
objectType string
statement *Statement
deps []string
}
func buildGraph(statements []*Statement) *DependencyGraph {
graph := &DependencyGraph{
nodes: make(map[string]*Node),
edges: make(map[string][]string),
}
for _, stmt := range statements {
node := extractDependencies(stmt)
graph.addNode(node)
}
return graph
}Topological Sort
To ensure safe ordering, pgsquash-engine performs topological sorting:
Original migrations (wrong order):
1. CREATE INDEX idx_users_email ON users(email);
2. CREATE TABLE users (id SERIAL, email VARCHAR(255));
Dependency graph:
idx_users_email → users (index depends on table)
Topologically sorted:
1. CREATE TABLE users (id SERIAL, email VARCHAR(255));
2. CREATE INDEX idx_users_email ON users(email);Circular Dependency Detection
If a circular dependency is detected:
-- Example circular dependency
CREATE TABLE a (id SERIAL, b_id INT REFERENCES b(id));
CREATE TABLE b (id SERIAL, a_id INT REFERENCES a(id));Engine response:
- Detect cycle:
a → b → a - Break cycle by deferring foreign keys
- Emit warning to user
- Provide recommended fix
Parallel Processing
Independent objects can be processed in parallel:
No dependencies between:
- CREATE TABLE users (...)
- CREATE TABLE products (...)
- CREATE EXTENSION vector
These can be consolidated in any order or in parallel.4. CONSOLIDATION
Consolidation Strategies
pgsquash-engine applies different strategies based on safety level.
Strategy 1: Sequential Operations
Pattern: Multiple ALTER statements on same table
-- Original (4 statements)
CREATE TABLE users (id SERIAL);
ALTER TABLE users ADD COLUMN email VARCHAR(255);
ALTER TABLE users ADD COLUMN name VARCHAR(100);
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();
-- Consolidated (1 statement)
CREATE TABLE users (
id SERIAL,
email VARCHAR(255),
name VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW()
);Safety check:
- Verify no data dependencies between statements
- Ensure column order doesn't matter
- Confirm no intermediate reads of partial schema
Strategy 2: Redundant Operations
Pattern: Create, drop, recreate cycles
-- Original (3 statements)
CREATE INDEX idx_users_email ON users(email);
DROP INDEX idx_users_email;
CREATE INDEX idx_users_email ON users(email);
-- Consolidated (1 statement)
CREATE INDEX idx_users_email ON users(email);Safety check:
- Verify final operation matches desired state
- Ensure no performance implications from intermediate states
- Check for DROP CASCADE scenarios
Strategy 3: Column Renames
Pattern: Rename chains
-- Original (3 statements)
ALTER TABLE users RENAME COLUMN name TO full_name;
ALTER TABLE users RENAME COLUMN full_name TO user_name;
ALTER TABLE users RENAME COLUMN user_name TO display_name;
-- Consolidated (1 statement)
ALTER TABLE users RENAME COLUMN name TO display_name;Strategy 4: Type Changes
Pattern: Multiple type alterations
-- Original (2 statements)
ALTER TABLE users ALTER COLUMN id TYPE BIGINT;
ALTER TABLE users ALTER COLUMN id SET DEFAULT gen_random_uuid();
-- Consolidated (1 statement - if safe)
ALTER TABLE users
ALTER COLUMN id TYPE BIGINT,
ALTER COLUMN id SET DEFAULT gen_random_uuid();Safety check (CRITICAL):
- Type changes can cause data loss
- Verify data compatibility
- Check for dependent objects (indexes, foreign keys)
- Consider performance impact
Safety Level Impact
| Strategy | Paranoid | Conservative | Standard | Aggressive |
|---|---|---|---|---|
| Sequential Operations | ☑ Same table only | ☑ Same table only | ☑ Cross-table | ☑ Full merge |
| Redundant Operations | ☑ Exact matches | ☑ Semantic matches | ☑ All redundant | ☑ All + optimizations |
| Rename Chains | ☒ Keep separate | ☑ Direct chains | ☑ All chains | ☑ All + dead code removal |
| Type Changes | ☒ Never merge | ☑ Safe types only | ☑ Compatible types | ☑ All (with warning) |
Rule Engine
Consolidation rules are applied via a pluggable rule system:
type Rule interface {
Name() string
Apply(statements []*Statement) ([]*Statement, error)
SafetyLevel() SafetyLevel
}
type ConsolidateTableAlters struct{}
func (r *ConsolidateTableAlters) Apply(stmts []*Statement) ([]*Statement, error) {
// Group ALTER statements by table
byTable := groupByTable(stmts)
var result []*Statement
for table, alters := range byTable {
// Check if all alters can be merged
if canMerge(alters) {
merged := mergeAlters(alters)
result = append(result, merged)
} else {
result = append(result, alters...)
}
}
return result, nil
}5. DOCKER VALIDATION
Validation Architecture
After consolidation, pgsquash-engine proves schema equivalence using PostgreSQL itself.
Process:
- Spin up two PostgreSQL containers
- Run original migrations in Container A
- Run consolidated migrations in Container B
- Export schemas from both containers
- Diff schemas byte-for-byte
- Report any differences
Schema Dumping
# Container A (original)
pg_dump --schema-only --no-owner --no-privileges db > original.sql
# Container B (consolidated)
pg_dump --schema-only --no-owner --no-privileges db > consolidated.sql
# Comparison
diff original.sql consolidated.sqlWhat Gets Validated
COMPLETE SCHEMA VALIDATION
Structural validation:
- Table structures (columns, types, defaults)
- Indexes (including partial indexes, expressions)
- Constraints (PK, FK, CHECK, UNIQUE, NOT NULL)
- Sequences (start values, increments)
- Views and materialized views
- Functions and procedures (signatures, bodies)
- Triggers (timing, events, functions)
- Extensions (versions, configurations)
- Types (enums, composites, domains)
- RLS policies (expressions, roles)
Data validation (optional):
- Sample data insertion
- Query result comparison
- Constraint enforcement testing
Validation Failures
If schemas don't match:
Example failure:
VALIDATION FAILED
Difference found in table 'users':
Original: email VARCHAR(255) UNIQUE
Squashed: email VARCHAR(255)
Missing constraint: users_email_key
Recommendation: Review migration 047 for UNIQUE constraintEngine response:
- Abort consolidation
- Report exact differences
- Provide fix recommendations
- Preserve original files (safe)
Performance Optimization
Validation can be slow for large schemas. Optimizations:
Parallel container startup:
// Start both containers simultaneously
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
startContainer("original", originalMigrations)
}()
go func() {
defer wg.Done()
startContainer("consolidated", consolidatedMigrations)
}()
wg.Wait()Schema caching:
- Cache pg_dump output for unchanged migrations
- Invalidate cache only when migrations change
- Speeds up repeated validations
Selective validation:
# Only validate changed objects
capysquash squash migrations/*.sql --validate-changed-onlySTREAMING ARCHITECTURE
For large migration sets (500+ files), pgsquash-engine uses streaming to reduce memory usage.
Streaming Pipeline
Input Files → Parser → Object Tracker → Consolidator → Output Files
↓ ↓ ↓ ↓ ↓
Read 50 Parse AST Track state Apply rules Write batch
↓ ↓ ↓ ↓ ↓
Read 50 Parse AST Track state Apply rules Write batch
↓ ↓ ↓ ↓ ↓
... ... ... ... ...Benefits:
- Constant memory usage (~100MB regardless of migration count)
- Handles 10,000+ migrations
- Real-time progress updates
Enable streaming:
capysquash squash migrations/*.sql --streaming --batch-size=50PERFORMANCE BENCHMARKS
Test environment: M1 MacBook Pro, 16GB RAM, Docker Desktop
Migration set: 287 files, 15 MB total
| Phase | Time | Memory |
|---|---|---|
| Parsing | 1.2s | 45 MB |
| Tracking | 0.8s | 32 MB |
| Dependency Resolution | 0.6s | 28 MB |
| Consolidation | 1.1s | 50 MB |
| Validation (Docker) | 8.5s | 180 MB |
| Total | 12.2s | 180 MB peak |
With streaming enabled: 11.8s, 95 MB peak
AI INTEGRATION (OPTIONAL)
When AI providers are configured, pgsquash-engine can perform semantic analysis.
Function Equivalency Detection
-- Migration 045
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
RETURN a + b;
END;
$$ LANGUAGE plpgsql;
-- Migration 089
CREATE OR REPLACE FUNCTION calculate_total(a INT, b INT)
RETURNS INT AS $$
BEGIN
RETURN a + b; -- same logic, different formatting
END;
$$ LANGUAGE plpgsql;AI analysis:
- Compare function bodies semantically
- Detect identical logic despite formatting differences
- Suggest removing duplicate redefinition
Dead Code Detection
AI can identify unused functions, triggers, or views based on:
- No references in codebase
- No usage in other migrations
- Semantic understanding of dependencies
Configuration
export ANTHROPIC_API_KEY="sk-ant-..."
capysquash analyze-deep migrations/*.sqlNEXT STEPS
How is this guide?