MIGRATION ANALYSIS

Learn how CAPYSQUASH analyzes your migrations to safely consolidate them.

OVERVIEW

capysquash uses a sophisticated multi-phase analysis engine to understand your migrations, track object lifecycles, and identify safe consolidation opportunities.

HOW IT WORKS

PARSING

Parse SQL using PostgreSQL's official parser

TRACKING

Track object lifecycles and changes over time

DEPENDENCY ANALYSIS

Build dependency graph and detect conflicts

CONSOLIDATION

Identify safe consolidation opportunities

GENERATION

Generate optimized, dependency-sorted migrations

PHASE 1: PARSING

PostgreSQL Parser

pgsquash uses PostgreSQL's official parser (pg_query_go) for 100% accurate SQL analysis.

Why this matters:

☑ Understands ALL PostgreSQL syntax
☑ Handles complex queries, CTEs, window functions
☑ Parses proprietary extensions (PostGIS, pgvector, etc.)
☑ Same parser PostgreSQL itself uses

Example:

-- capysquash can parse complex statements like:
CREATE TABLE users (
  id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
  data jsonb NOT NULL,
  search tsvector GENERATED ALWAYS AS (
    to_tsvector('english', data->>'name')
  ) STORED
);

Statement Extraction

From each migration file, capysquash extracts:

📊 METADATA

► Object type (table, index, function)
► Object name and schema
► Operation type (CREATE, ALTER, DROP)
► Source file and line number

🔗 RELATIONSHIPS

► Foreign key references
► Function dependencies
► Schema dependencies
► Extension requirements

PHASE 2: OBJECT TRACKING

Lifecycle Tracking

capysquash tracks how each database object evolves:

Table: users
├─ Migration 001: CREATE TABLE users (id INT)
├─ Migration 003: ALTER TABLE users ADD COLUMN email TEXT
├─ Migration 007: ALTER TABLE users ADD COLUMN name TEXT
├─ Migration 012: ALTER TABLE users ALTER COLUMN email SET NOT NULL
└─ Final state: users (id INT, email TEXT NOT NULL, name TEXT)

Consolidation opportunity: Merge into single CREATE TABLE statement.

State Management

For each object, capysquash tracks:

Property	Description	Example
Current State	Latest definition	`users (id, email, name)`
Change History	All modifications	4 ALTERs tracked
Dependencies	What it references	FK to `profiles` table
Dependents	What references it	Index `idx_users_email`
Lifecycle Events	CREATE, ALTER, DROP	Created in 001, altered 3x

Example Tracking

Input migrations:

-- 001_init.sql
CREATE TABLE posts (
  id serial PRIMARY KEY
);

-- 003_add_title.sql
ALTER TABLE posts ADD COLUMN title text;

-- 007_add_author.sql
ALTER TABLE posts ADD COLUMN author_id int;

-- 012_add_fk.sql
ALTER TABLE posts
  ADD CONSTRAINT fk_author
  FOREIGN KEY (author_id) REFERENCES users(id);

Tracked state:

Table: posts
  Created: 001_init.sql
  Columns added: title (003), author_id (007)
  Constraints added: fk_author (012)
  Dependencies: users table
  Final statement: CREATE TABLE posts (
    id serial PRIMARY KEY,
    title text,
    author_id int,
    CONSTRAINT fk_author FOREIGN KEY (author_id) REFERENCES users(id)
  );

PHASE 3: DEPENDENCY ANALYSIS

Dependency Graph

capysquash builds a directed acyclic graph (DAG) of dependencies:

Extensions
    │
    ├─→ Schemas
    │       │
    │       ├─→ Custom Types (ENUMs, Composite)
    │       │       │
    │       │       └─→ Tables
    │       │               │
    │       │               ├─→ Constraints
    │       │               │
    │       │               ├─→ Indexes
    │       │               │
    │       │               └─→ Functions/Triggers
    │       │
    │       └─→ RLS Policies
    │
    └─→ Data (INSERTs)

Dependency Types

DEPENDENCY CATEGORIES

1. SCHEMA DEPENDENCIES

Table must be created before FK constraint

users → posts.fk_author

2. TYPE DEPENDENCIES

ENUM must exist before column uses it

CREATE TYPE status → users.status status

3. FUNCTION DEPENDENCIES

Function before trigger

handle_updated_at() → trigger on users

4. EXTENSION DEPENDENCIES

Extension before its features

uuid-ossp → gen_random_uuid()

5. DATA DEPENDENCIES

Table before INSERTs

CREATE TABLE → INSERT INTO

Circular Dependency Detection

capysquash detects and resolves circular dependencies:

Example problem:

-- Table A references B
CREATE TABLE a (b_id INT REFERENCES b(id));

-- Table B references A
CREATE TABLE b (a_id INT REFERENCES a(id));

Solution:

-- Create tables without FKs first
CREATE TABLE a (b_id INT);
CREATE TABLE b (a_id INT);

-- Add FKs after both exist
ALTER TABLE a ADD CONSTRAINT fk_b FOREIGN KEY (b_id) REFERENCES b(id);
ALTER TABLE b ADD CONSTRAINT fk_a FOREIGN KEY (a_id) REFERENCES a(id);

PHASE 4: CONSOLIDATION DETECTION

Consolidation Opportunities

capysquash identifies multiple types of consolidation:

☑ SAFE CONSOLIDATIONS

► CREATE + ALTER → Single CREATE
► Multiple ALTERs → Combined ALTER
► Duplicate indexes → Deduplicated
► Identical functions → Single definition
► Multiple GRANTs → Consolidated

⚠️ ADVANCED (Higher safety levels)

► DROP → CREATE cycles → Single CREATE
► Unused columns → Removed
► Dead code → Eliminated
► RLS policies → Optimized

Example: CREATE + ALTER Consolidation

Before (3 migrations):

-- Migration 001
CREATE TABLE users (id INT PRIMARY KEY);

-- Migration 005
ALTER TABLE users ADD COLUMN email TEXT;

-- Migration 012
ALTER TABLE users ADD COLUMN name TEXT NOT NULL;

After (consolidated):

CREATE TABLE users (
  id INT PRIMARY KEY,
  email TEXT,
  name TEXT NOT NULL
);

Safety: ☑ Safe - final schema is identical

PHASE 5: AI-POWERED ANALYSIS (Optional)

When enabled, AI analysis provides semantic understanding:

What AI Analyzes

Function Similarity
- Identifies functionally identical but differently formatted code
- Detects semantic duplicates
Dead Code Detection
- Finds unused functions
- Identifies orphaned triggers
- Detects unreferenced types
Optimization Suggestions
- Better index strategies
- RLS policy improvements
- Query performance hints

Example AI Insight

-- AI detects these are semantically identical:
CREATE FUNCTION update_timestamp_v1()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE FUNCTION update_modified_at()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = CURRENT_TIMESTAMP;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

AI suggests: Consolidate to single function (NOW() and CURRENT_TIMESTAMP are equivalent).

PHASE 6: OUTPUT GENERATION

Dependency-Sorted Output

capysquash generates migrations in dependency order:

Organized Categories

Category	Contents	Why This Order
Extensions	CREATE EXTENSION	Required by other objects
Schemas & Types	CREATE SCHEMA, TYPE, ENUM	Referenced by tables
Tables	CREATE TABLE (structure only)	Core schema
Constraints	FK, CHECK, UNIQUE	After tables exist
Indexes	CREATE INDEX	After constraints
Functions/Triggers	Functions, triggers, views	After tables
RLS/Security	RLS policies, GRANTs	After functions
Data	INSERT statements	After everything

SAFETY GUARANTEES

capysquash provides multiple safety guarantees:

SAFETY MECHANISMS

☑ PARSE VALIDATION

Every statement parsed with PostgreSQL's parser before and after

☑ DEPENDENCY VERIFICATION

All dependencies resolved before output generation

☑ SCHEMA COMPARISON (with Docker)

Byte-for-byte comparison of resulting schemas

☑ ROLLBACK SUPPORT

Keep originals, test consolidated versions separately

☑ DRY RUN MODE

Preview changes without modifying files

PERFORMANCE OPTIMIZATIONS

For large migration sets (500+ files):

Streaming Mode

capysquash squash migrations/*.sql --streaming --memory-limit 512

Benefits:

Processes files in batches
Reduced memory footprint
Handles massive migration sets

Parallel Processing

capysquash squash migrations/*.sql --parallel 8