CAPYSQUASH

PostgreSQL Parser Deep Dive: How CAPYSQUASH Achieves 100% Accuracy

Explore the sophisticated PostgreSQL parsing technology that powers CAPYSQUASH's migration optimization - from AST analysis to dependency resolution

PostgreSQL Parser Deep Dive: How CAPYSQUASH Achieves 100% Accuracy

Discover the sophisticated parsing technology that enables CAPYSQUASH to understand PostgreSQL migrations with the same accuracy as PostgreSQL itself.

Unlike traditional tools that rely on regex patterns and heuristics, CAPYSQUASH uses PostgreSQL's actual parser to achieve perfect SQL understanding. This deep dive explores how we leverage parser-grade accuracy to deliver migration optimization that never breaks your schema.

🧠 Why Traditional Parsing Fails

The Regex Problem

Most migration tools use pattern matching and regular expressions to parse SQL. This approach has fundamental limitations:

Parser-Grade Understanding: CAPYSQUASH doesn't pattern-match - it comprehends. By using PostgreSQL's actual parser, we understand SQL structure, context, and semantics with 100% accuracy.

Common Regex Failures:

-- ☒ Regex can't handle nested structures
CREATE TABLE complex (
    id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
    data jsonb DEFAULT '{"nested": {"deep": {"value": true}}}'::jsonb,
    constraint check_complex CHECK (
        (data->>'type')::text IN ('user', 'admin') AND
        (data->'permissions'->'level')::int >= 1
    )
);

-- ☒ Regex breaks on complex string literals
CREATE TABLE messages (
    content text DEFAULT 'Hello, this contains "quotes" and ''apostrophes'' and \n newlines'
);

-- ☒ Regex can't understand context-dependent keywords
CREATE TABLE user_roles (
    role text CHECK (role IN ('user', 'admin', 'read-only')) -- 'read-only' isn't a keyword
);

The Semantic Understanding Problem

Traditional tools can't understand SQL semantics:

-- Traditional tools see: CREATE TABLE, ALTER TABLE, DROP TABLE
-- CAPYSQUASH sees: Object lifecycle, dependencies, constraints

-- Phase 1: Initial creation
CREATE TABLE users (id uuid, email text);

-- Phase 2: Evolution
ALTER TABLE users ADD COLUMN created_at timestamptz;
ALTER TABLE users ADD COLUMN updated_at timestamptz;
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- Phase 3: Optimization
CREATE TABLE users (
    id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    email text NOT NULL,
    created_at timestamptz DEFAULT now(),
    updated_at timestamptz DEFAULT now()
);

🔬 The PostgreSQL Parser Technology

pg_query_go: PostgreSQL's C Parser in Go

CAPYSQUASH leverages pg_query_go, a Go wrapper around PostgreSQL's actual C parser:

// This is what PostgreSQL itself uses to parse SQL
// CAPYSQUASH uses the exact same technology

type RawStmt struct {
   Stmt   Node
   StmtLocation int
   StmtLen      int
}

type CreateStmt struct {
   Relation        *RangeVar
   TableElts       []Node
   InhRelations    []Node
   Partbound       Node
   Opts            []Node
   Oncommit        OnCommitAction
   Options         []Node
}

Why pg_query_go is Superior:

  • 100% PostgreSQL Compatible: Handles any valid PostgreSQL syntax
  • Context-Aware: Understands keywords based on context, not patterns
  • Structure-Preserving: Maintains exact semantic meaning
  • Version-Specific: Handles PostgreSQL version differences
  • Extension-Aware: Understands extension-specific syntax

Abstract Syntax Trees (AST)

Every SQL statement is parsed into a detailed AST:

-- Input SQL
CREATE TABLE users (
    id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    email text NOT NULL UNIQUE,
    created_at timestamptz DEFAULT now()
);

-- AST Structure (simplified)
CreateStmt {
    Relation: RangeVar { relname: "users" },
    TableElts: [
        ColumnDef {
            colname: "id",
            typeName: SystemTypeName { names: ["uuid"] },
            constraints: [
                Constraint { contype: CONSTR_PRIMARY },
                Constraint {
                    contype: CONSTR_DEFAULT,
                    raw_expr: FuncCall { funcname: ["gen_random_uuid"] }
                }
            ]
        },
        ColumnDef {
            colname: "email",
            typeName: SystemTypeName { names: ["text"] },
            constraints: [
                Constraint { contype: CONSTR_NOTNULL },
                Constraint { contype: CONSTR_UNIQUE }
            ]
        },
        ColumnDef {
            colname: "created_at",
            typeName: SystemTypeName { names: ["timestamptz"] },
            constraints: [
                Constraint {
                    contype: CONSTR_DEFAULT,
                    raw_expr: FuncCall { funcname: ["now"] }
                }
            ]
        }
    ]
}

🎯 Advanced Parsing Capabilities

1. Complex Data Type Handling

CAPYSQUASH understands PostgreSQL's rich type system:

-- Arrays with complex elements
CREATE TABLE posts (
    tags text[] DEFAULT ARRAY['postgresql', 'optimization'],
    metadata jsonb DEFAULT '{"views": 0, "likes": []}'::jsonb,
    ranges numrange[] DEFAULT '{[1,10), [20,30)}'
);

-- Custom types and domains
CREATE DOMAIN email_address AS text
    CHECK (VALUE ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');

CREATE TABLE users (
    contact_info email_address,
    preferences user_preferences_type
);

-- Enum types
CREATE TYPE user_status AS ENUM ('active', 'inactive', 'suspended');
CREATE TABLE users (status user_status DEFAULT 'active');

2. Advanced Constraint Understanding

CAPYSQUASH parses and preserves complex constraints:

-- Multi-column constraints
CREATE TABLE orders (
    user_id uuid,
    product_id uuid,
    quantity int,
    PRIMARY KEY (user_id, product_id),
    CONSTRAINT valid_quantity CHECK (quantity > 0 AND quantity <= 1000),
    CONSTRAINT user_product_unique UNIQUE (user_id, product_id)
);

-- Foreign key constraints with actions
CREATE TABLE order_items (
    order_id uuid REFERENCES orders(id) ON DELETE CASCADE,
    product_id uuid REFERENCES products(id) ON DELETE RESTRICT,
    quantity int NOT NULL CHECK (quantity > 0)
);

-- Complex check constraints
CREATE TABLE employees (
    salary numeric,
    department text,
    CONSTRAINT salary_check CHECK (
        CASE
            WHEN department = 'engineering' THEN salary >= 80000
            WHEN department = 'sales' THEN salary >= 60000
            ELSE salary >= 40000
        END
    )
);

3. Function and Procedure Parsing

CAPYSQUASH understands complex function definitions:

-- PL/pgSQL functions with complex logic
CREATE OR REPLACE FUNCTION calculate_user_metrics(user_id_param uuid)
RETURNS TABLE (
    total_orders bigint,
    total_spent numeric,
    favorite_category text
) AS $$
DECLARE
    user_orders RECORD;
    category_counts RECORD;
BEGIN
    -- Complex query with CTEs and window functions
    WITH user_order_data AS (
        SELECT
            o.id,
            o.total_amount,
            p.category,
            ROW_NUMBER() OVER (PARTITION BY p.category ORDER BY o.created_at DESC) as rn
        FROM orders o
        JOIN products p ON o.product_id = p.id
        WHERE o.user_id = user_id_param
    ),
    category_stats AS (
        SELECT
            category,
            COUNT(*) as order_count,
            SUM(total_amount) as category_total
        FROM user_order_data
        GROUP BY category
    )
    SELECT
        COUNT(*) as total_orders,
        COALESCE(SUM(total_amount), 0) as total_spent,
        (SELECT category FROM category_stats ORDER BY order_count DESC LIMIT 1) as favorite_category
    INTO total_orders, total_spent, favorite_category
    FROM user_order_data;

    RETURN NEXT;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

🔄 Dependency Resolution Through AST Analysis

Building Object Dependency Graphs

CAPYSQUASH analyzes ASTs to build precise dependency graphs:

-- These migrations have complex dependencies
-- Migration 1: Base types
CREATE TYPE user_status AS ENUM ('active', 'inactive');

-- Migration 2: User table with type dependency
CREATE TABLE users (
    id uuid PRIMARY KEY,
    status user_status DEFAULT 'active'
);

-- Migration 3: Function with table dependency
CREATE FUNCTION get_active_users()
RETURNS TABLE (id uuid, status user_status) AS $$
    SELECT id, status FROM users WHERE status = 'active';
$$ LANGUAGE sql;

-- Migration 4: View with function dependency
CREATE VIEW active_user_summary AS
    SELECT COUNT(*) as active_count
    FROM get_active_users();

-- Migration 5: Index with view dependency (PostgreSQL 14+)
CREATE INDEX idx_active_user_count ON users ((SELECT COUNT(*) FROM active_user_summary));

Dependency Graph Built by CAPYSQUASH:

user_status (type)

users (table)

get_active_users (function)

active_user_summary (view)

idx_active_user_count (index)

Circular Dependency Detection

CAPYSQUASH identifies and resolves circular dependencies:

-- Circular dependency example
CREATE TABLE users (
    id uuid PRIMARY KEY,
    manager_id uuid REFERENCES users(id) -- Self-reference
);

CREATE TABLE departments (
    id uuid PRIMARY KEY,
    head_user_id uuid REFERENCES users(id)
);

-- Function that references both tables
CREATE FUNCTION get_department_head(dept_id uuid)
RETURNS uuid AS $$
    SELECT head_user_id FROM departments WHERE id = dept_id;
$$ LANGUAGE sql;

-- CAPYSQUASH detects the cycle and handles it correctly

🛡️ Safety Through Semantic Understanding

Semantic Equivalence Preservation

CAPYSQUASH ensures semantic equivalence through deep understanding:

-- Before: Multiple migrations
-- Migration 1
CREATE TABLE users (id uuid, email text);

-- Migration 2
ALTER TABLE users ADD COLUMN created_at timestamptz DEFAULT now();

-- Migration 3
ALTER TABLE users ADD COLUMN updated_at timestamptz DEFAULT now();

-- Migration 4
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- Migration 5
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);

-- After: Semantically equivalent single migration
CREATE TABLE users (
    id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    email text NOT NULL UNIQUE,
    created_at timestamptz DEFAULT now(),
    updated_at timestamptz DEFAULT now()
);

Constraint and Index Preservation

CAPYSQUASH preserves all constraints and indexes with exact semantics:

-- Complex constraints preserved exactly
CREATE TABLE products (
    id uuid PRIMARY KEY,
    name text NOT NULL,
    price numeric CHECK (price > 0),
    discount_price numeric CHECK (discount_price >= 0 AND discount_price < price),
    CONSTRAINT valid_pricing CHECK (
        (discount_price IS NULL) OR (discount_price < price)
    ),
    CONSTRAINT name_length CHECK (length(name) >= 2)
);

-- Indexes with expressions preserved
CREATE INDEX idx_products_search ON products (
    lower(name),
    CASE WHEN discount_price IS NOT NULL THEN discount_price ELSE price END
);

-- Partial indexes preserved
CREATE INDEX idx_active_products ON products (id) WHERE price > 0 AND name IS NOT NULL;

📊 Performance Optimization Through Parser Intelligence

Query Pattern Recognition

CAPYSQUASH identifies query patterns for optimization:

-- Pattern: Repeated ALTER TABLE operations
-- Before:
ALTER TABLE users ADD COLUMN first_name text;
ALTER TABLE users ADD COLUMN last_name text;
ALTER TABLE users ADD COLUMN email text;
ALTER TABLE users ADD COLUMN phone text;

-- After: Optimized single CREATE TABLE
CREATE TABLE users (
    id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    email text,
    phone text
);

Dead Code Elimination

CAPYSQUASH identifies and removes unused objects:

-- Before: Unused objects
CREATE TABLE temp_users (id uuid, email text); -- Never referenced
CREATE INDEX idx_temp_users_email ON temp_users(email); -- Orphaned index
CREATE FUNCTION unused_function() RETURNS int AS $$ SELECT 1; $$ LANGUAGE sql; -- Never called

-- After: Clean, optimized schema
-- Unused objects safely removed

🔧 Advanced Parser Features

1. Extension Syntax Support

CAPYSQUASH understands extension-specific syntax:

-- PostGIS extension
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE locations (
    id uuid PRIMARY KEY,
    point geometry(Point, 4326),
    area geometry(Polygon, 4326)
);
CREATE INDEX idx_locations_point ON locations USING GIST (point);

-- TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE TABLE metrics (
    time timestamptz NOT NULL,
    value double precision,
    device_id text
);
SELECT create_hypertable('metrics', 'time');

-- UUID extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE users (
    id uuid PRIMARY KEY DEFAULT uuid_generate_v4()
);

2. Version-Specific Parsing

CAPYSQUASH handles PostgreSQL version differences:

-- PostgreSQL 12+ generated columns
CREATE TABLE users (
    id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    full_name text GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED
);

-- PostgreSQL 14+ multirange types
CREATE TABLE bookings (
    id uuid PRIMARY KEY,
    stay_dates daterange DEFAULT daterange(current_date, current_date + 7)
);

-- PostgreSQL 15+ security labels
CREATE TABLE sensitive_data (
    id uuid PRIMARY KEY,
    data text,
    SECURITY LABEL FOR 'row_level' ON sensitive_data IS 'classified';
);

3. Complex Expression Parsing

CAPYSQUASH handles complex expressions and subqueries:

-- Complex expressions in defaults
CREATE TABLE orders (
    id uuid PRIMARY KEY,
    total_amount numeric,
    tax_amount numeric GENERATED ALWAYS AS (total_amount * 0.08) STORED,
    final_amount numeric GENERATED ALWAYS AS (total_amount + (total_amount * 0.08)) STORED
);

-- Subqueries in check constraints
CREATE TABLE employees (
    id uuid PRIMARY KEY,
    salary numeric,
    department_id uuid,
    CONSTRAINT salary_within_range CHECK (
        salary BETWEEN (
            SELECT min_salary
            FROM departments
            WHERE id = department_id
        ) AND (
            SELECT max_salary
            FROM departments
            WHERE id = department_id
        )
    )
);

🎯 Real-World Parser Benefits

Case Study: Complex Financial Application

Challenge: 234 migration files with complex financial calculations and constraints.

Traditional Tool Results:

  • 45 parsing errors
  • 12 constraint violations
  • 3 hours manual fixing required

CAPYSQUASH Results:

  • 0 parsing errors
  • 100% constraint preservation
  • 12 files after optimization (95% reduction)
  • 5 minutes total processing time

Case Study: Multi-Tenant SaaS with RLS

Challenge: Row Level Security policies across 50+ tables with complex expressions.

Traditional Tool Results:

  • RLS policies broken or lost
  • Security vulnerabilities introduced
  • Manual policy recreation required

CAPYSQUASH Results:

  • All RLS policies preserved exactly
  • Security maintained 100%
  • Policy dependencies correctly ordered
  • Zero security issues

🔍 Parser vs. Regex: Technical Comparison

FeatureRegex-Based ToolsCAPYSQUASH Parser
Nested StructuresFails on depth > 2Handles any depth
String LiteralsBreaks on quotes/apostrophesPerfect handling
Context KeywordsFalse positives/negativesContext-aware
Complex ConstraintsOften missedFully understood
Extension SyntaxNot supportedFull support
Version DifferencesOne-size-fits-allVersion-specific
Error RecoveryPoorPostgreSQL-grade
Semantic UnderstandingNoneComplete

🚀 The Future of PostgreSQL Parsing

CAPYSQUASH continues to advance PostgreSQL parsing technology:

Upcoming Enhancements

  • PostgreSQL 16 Support: Latest syntax and features
  • Enhanced Error Messages: More detailed parsing feedback
  • Performance Optimizations: Faster parsing for large migration sets
  • Custom Parser Extensions: Support for custom PostgreSQL forks

Research Contributions

CAPYSQUASH contributes to PostgreSQL parsing research:

  • Parser Accuracy Studies: Benchmarking different approaches
  • Migration Pattern Analysis: Understanding common patterns
  • Optimization Algorithms: Advanced consolidation strategies

🎯 Next Steps


Ready to experience 100% parsing accuracy? Install pgsquash-engine or try CAPYSQUASH Platform and see the difference parser-grade technology makes.

How is this guide?

On this page