CAPYSQUASH

Migration Analysis

How CAPYSQUASH analyzes and understands your migrations

MIGRATION ANALYSIS

Learn how CAPYSQUASH analyzes your migrations to safely consolidate them.

OVERVIEW

capysquash uses a sophisticated multi-phase analysis engine to understand your migrations, track object lifecycles, and identify safe consolidation opportunities.

HOW IT WORKS

1

PARSING

Parse SQL using PostgreSQL's official parser

2

TRACKING

Track object lifecycles and changes over time

3

DEPENDENCY ANALYSIS

Build dependency graph and detect conflicts

4

CONSOLIDATION

Identify safe consolidation opportunities

5

GENERATION

Generate optimized, dependency-sorted migrations

PHASE 1: PARSING

PostgreSQL Parser

pgsquash uses PostgreSQL's official parser (pg_query_go) for 100% accurate SQL analysis.

Why this matters:

  • ☑ Understands ALL PostgreSQL syntax
  • ☑ Handles complex queries, CTEs, window functions
  • ☑ Parses proprietary extensions (PostGIS, pgvector, etc.)
  • ☑ Same parser PostgreSQL itself uses

Example:

-- capysquash can parse complex statements like:
CREATE TABLE users (
  id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
  data jsonb NOT NULL,
  search tsvector GENERATED ALWAYS AS (
    to_tsvector('english', data->>'name')
  ) STORED
);

Statement Extraction

From each migration file, capysquash extracts:

📊 METADATA

  • ► Object type (table, index, function)
  • ► Object name and schema
  • ► Operation type (CREATE, ALTER, DROP)
  • ► Source file and line number

🔗 RELATIONSHIPS

  • ► Foreign key references
  • ► Function dependencies
  • ► Schema dependencies
  • ► Extension requirements

PHASE 2: OBJECT TRACKING

Lifecycle Tracking

capysquash tracks how each database object evolves:

Table: users
├─ Migration 001: CREATE TABLE users (id INT)
├─ Migration 003: ALTER TABLE users ADD COLUMN email TEXT
├─ Migration 007: ALTER TABLE users ADD COLUMN name TEXT
├─ Migration 012: ALTER TABLE users ALTER COLUMN email SET NOT NULL
└─ Final state: users (id INT, email TEXT NOT NULL, name TEXT)

Consolidation opportunity: Merge into single CREATE TABLE statement.

State Management

For each object, capysquash tracks:

PropertyDescriptionExample
Current StateLatest definitionusers (id, email, name)
Change HistoryAll modifications4 ALTERs tracked
DependenciesWhat it referencesFK to profiles table
DependentsWhat references itIndex idx_users_email
Lifecycle EventsCREATE, ALTER, DROPCreated in 001, altered 3x

Example Tracking

Input migrations:

-- 001_init.sql
CREATE TABLE posts (
  id serial PRIMARY KEY
);

-- 003_add_title.sql
ALTER TABLE posts ADD COLUMN title text;

-- 007_add_author.sql
ALTER TABLE posts ADD COLUMN author_id int;

-- 012_add_fk.sql
ALTER TABLE posts
  ADD CONSTRAINT fk_author
  FOREIGN KEY (author_id) REFERENCES users(id);

Tracked state:

Table: posts
  Created: 001_init.sql
  Columns added: title (003), author_id (007)
  Constraints added: fk_author (012)
  Dependencies: users table
  Final statement: CREATE TABLE posts (
    id serial PRIMARY KEY,
    title text,
    author_id int,
    CONSTRAINT fk_author FOREIGN KEY (author_id) REFERENCES users(id)
  );

PHASE 3: DEPENDENCY ANALYSIS

Dependency Graph

capysquash builds a directed acyclic graph (DAG) of dependencies:

Extensions

    ├─→ Schemas
    │       │
    │       ├─→ Custom Types (ENUMs, Composite)
    │       │       │
    │       │       └─→ Tables
    │       │               │
    │       │               ├─→ Constraints
    │       │               │
    │       │               ├─→ Indexes
    │       │               │
    │       │               └─→ Functions/Triggers
    │       │
    │       └─→ RLS Policies

    └─→ Data (INSERTs)

Dependency Types

DEPENDENCY CATEGORIES

1. SCHEMA DEPENDENCIES

Table must be created before FK constraint

users → posts.fk_author

2. TYPE DEPENDENCIES

ENUM must exist before column uses it

CREATE TYPE status → users.status status

3. FUNCTION DEPENDENCIES

Function before trigger

handle_updated_at() → trigger on users

4. EXTENSION DEPENDENCIES

Extension before its features

uuid-ossp → gen_random_uuid()

5. DATA DEPENDENCIES

Table before INSERTs

CREATE TABLE → INSERT INTO

Circular Dependency Detection

capysquash detects and resolves circular dependencies:

Example problem:

-- Table A references B
CREATE TABLE a (b_id INT REFERENCES b(id));

-- Table B references A
CREATE TABLE b (a_id INT REFERENCES a(id));

Solution:

-- Create tables without FKs first
CREATE TABLE a (b_id INT);
CREATE TABLE b (a_id INT);

-- Add FKs after both exist
ALTER TABLE a ADD CONSTRAINT fk_b FOREIGN KEY (b_id) REFERENCES b(id);
ALTER TABLE b ADD CONSTRAINT fk_a FOREIGN KEY (a_id) REFERENCES a(id);

PHASE 4: CONSOLIDATION DETECTION

Consolidation Opportunities

capysquash identifies multiple types of consolidation:

☑ SAFE CONSOLIDATIONS

  • ► CREATE + ALTER → Single CREATE
  • ► Multiple ALTERs → Combined ALTER
  • ► Duplicate indexes → Deduplicated
  • ► Identical functions → Single definition
  • ► Multiple GRANTs → Consolidated

⚠️ ADVANCED (Higher safety levels)

  • ► DROP → CREATE cycles → Single CREATE
  • ► Unused columns → Removed
  • ► Dead code → Eliminated
  • ► RLS policies → Optimized

Example: CREATE + ALTER Consolidation

Before (3 migrations):

-- Migration 001
CREATE TABLE users (id INT PRIMARY KEY);

-- Migration 005
ALTER TABLE users ADD COLUMN email TEXT;

-- Migration 012
ALTER TABLE users ADD COLUMN name TEXT NOT NULL;

After (consolidated):

CREATE TABLE users (
  id INT PRIMARY KEY,
  email TEXT,
  name TEXT NOT NULL
);

Safety: ☑ Safe - final schema is identical

PHASE 5: AI-POWERED ANALYSIS (Optional)

When enabled, AI analysis provides semantic understanding:

What AI Analyzes

  1. Function Similarity

    • Identifies functionally identical but differently formatted code
    • Detects semantic duplicates
  2. Dead Code Detection

    • Finds unused functions
    • Identifies orphaned triggers
    • Detects unreferenced types
  3. Optimization Suggestions

    • Better index strategies
    • RLS policy improvements
    • Query performance hints

Example AI Insight

-- AI detects these are semantically identical:
CREATE FUNCTION update_timestamp_v1()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE FUNCTION update_modified_at()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = CURRENT_TIMESTAMP;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

AI suggests: Consolidate to single function (NOW() and CURRENT_TIMESTAMP are equivalent).

PHASE 6: OUTPUT GENERATION

Dependency-Sorted Output

capysquash generates migrations in dependency order:

Organized Categories

CategoryContentsWhy This Order
ExtensionsCREATE EXTENSIONRequired by other objects
Schemas & TypesCREATE SCHEMA, TYPE, ENUMReferenced by tables
TablesCREATE TABLE (structure only)Core schema
ConstraintsFK, CHECK, UNIQUEAfter tables exist
IndexesCREATE INDEXAfter constraints
Functions/TriggersFunctions, triggers, viewsAfter tables
RLS/SecurityRLS policies, GRANTsAfter functions
DataINSERT statementsAfter everything

SAFETY GUARANTEES

capysquash provides multiple safety guarantees:

SAFETY MECHANISMS

☑ PARSE VALIDATION

Every statement parsed with PostgreSQL's parser before and after

☑ DEPENDENCY VERIFICATION

All dependencies resolved before output generation

☑ SCHEMA COMPARISON (with Docker)

Byte-for-byte comparison of resulting schemas

☑ ROLLBACK SUPPORT

Keep originals, test consolidated versions separately

☑ DRY RUN MODE

Preview changes without modifying files

PERFORMANCE OPTIMIZATIONS

For large migration sets (500+ files):

Streaming Mode

capysquash squash migrations/*.sql --streaming --memory-limit 512

Benefits:

  • Processes files in batches
  • Reduced memory footprint
  • Handles massive migration sets

Parallel Processing

capysquash squash migrations/*.sql --parallel 8

Benefits:

  • Parse files concurrently
  • Faster for large projects
  • Utilizes multiple CPU cores

ANALYSIS METRICS

When you run analysis, you see:

📊 ANALYSIS RESULTS

Files: 156 migrations
Statements: 2,847 total
Size: 12.4 MB

🎯 CONSOLIDATION POTENTIAL: 78%

Opportunities detected:
  ► CREATE + ALTER pairs: 89 instances
  ► Duplicate indexes: 23 found
  ► Duplicate functions: 12 found
  ► RLS policies: 34 can be optimized
  ► Dead code: 8 unused functions

Estimated output:
  ► Files: 34 (78% reduction)
  ► Statements: 1,124 (61% reduction)
  ► Size: ~4.2 MB (66% reduction)

NEXT STEPS

How is this guide?

On this page