CAPYSQUASH

SQL Parsing with pg_query_go

How pgsquash-engine uses PostgreSQL's actual parser for accurate SQL analysis

SQL PARSING WITH PG_QUERY_GO

Understanding the technology: pgsquash-engine uses PostgreSQL's actual parser

💡 This is the underlying technology that powers CAPYSQUASH Platform, capysquash-cli, and pgsquash-engine.

THE CHALLENGE

PostgreSQL's SQL dialect is notoriously complex:

WHY SQL PARSING IS HARD

  • 1.Context-Sensitive Grammar - The same syntax means different things in different contexts
  • 2.Thousands of Keywords - Reserved words, type names, function names that overlap
  • 3.Complex Type System - Arrays, composites, ranges, domains, ENUMs with custom casts
  • 4.Procedural Languages - PL/pgSQL, PL/Python, PL/v8 embedded in SQL
  • 5.Extensions - PostGIS, pg_trgm, uuid-ossp add their own syntax
  • 6.Version Differences - New features in PG 12, 13, 14, 15, 16, 17

THE SOLUTION: PG_QUERY_GO

pgsquash doesn't reimplement PostgreSQL's parser. It uses PostgreSQL's actual parser.

THE ADVANTAGE

pg_query_go is a Go binding to libpg_query, which embeds PostgreSQL's actual C parser.

This means pgsquash-engine parses SQL the same way PostgreSQL does - using the actual parser code from PostgreSQL itself.

What is pg_query_go?

┌─────────────────────────────────────────────────────────────┐
│         CAPYSQUASH Platform + capysquash-cli (Go)          │
├─────────────────────────────────────────────────────────────┤
│                pgsquash-engine (Go library)                 │
├─────────────────────────────────────────────────────────────┤
│                    pg_query_go (Go bindings)                 │
├─────────────────────────────────────────────────────────────┤
│                 libpg_query (C library)                      │
├─────────────────────────────────────────────────────────────┤
│            PostgreSQL Parser (gram.y, parse_*.c)             │
│         The same code PostgreSQL uses internally            │
└─────────────────────────────────────────────────────────────┘

Key Properties:

  • Maintained by PostgreSQL core team (indirectly)
  • Updated with every PostgreSQL release
  • 100% compatibility with PostgreSQL behavior
  • Parses to Abstract Syntax Trees (AST)
  • Zero ambiguity in parsing

PARSING WORKFLOW

Step 1: Load Migration Files

// pgsquash reads your SQL files
files, err := loader.LoadMigrations("migrations/*.sql")
// files = ["001_create_users.sql", "002_add_email.sql", ...]

Step 2: Parse to AST

import "github.com/pganalyze/pg_query_go/v5"

// Parse SQL to Abstract Syntax Tree
tree, err := pg_query.Parse(sqlContent)

// tree contains structured representation:
// - Statement type (CREATE TABLE, ALTER TABLE, etc.)
// - Object names (table name, column names)
// - Data types
// - Constraints
// - Dependencies

Example AST Output

Input SQL:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT NOW()
);

Parsed AST (simplified):

{
  "stmts": [{
    "stmt": {
      "CreateStmt": {
        "relation": {
          "relname": "users",
          "schemaname": "public"
        },
        "tableElts": [
          {
            "ColumnDef": {
              "colname": "id",
              "typeName": { "names": ["pg_catalog", "int4"] },
              "constraints": [{
                "Constraint": {
                  "contype": "CONSTR_PRIMARY"
                }
              }]
            }
          },
          {
            "ColumnDef": {
              "colname": "email",
              "typeName": { "names": ["pg_catalog", "varchar"] },
              "typemod": 259,  // length 255
              "constraints": [
                { "Constraint": { "contype": "CONSTR_NOTNULL" } },
                { "Constraint": { "contype": "CONSTR_UNIQUE" } }
              ]
            }
          }
        ]
      }
    }
  }]
}

Step 3: Extract Metadata

// pgsquash analyzes the AST to extract:
// - Table name: "users"
// - Columns: ["id", "email", "created_at"]
// - Types: [int4, varchar(255), timestamp]
// - Constraints: [PRIMARY KEY, NOT NULL, UNIQUE]
// - Defaults: [nextval('users_id_seq'), none, NOW()]

Step 4: Build Dependency Graph

// Track relationships between objects
dependencies := tracker.BuildDependencyGraph(statements)

// Example relationships:
// - users.id (column) depends on users (table)
// - users_id_seq (sequence) depends on users.id (column)
// - users_email_key (constraint) depends on users.email (column)

Step 5: Identify Consolidation Opportunities

// Find operations that can be safely merged
opportunities := squasher.FindMergeablehooks(dependencies, safetyLevel)

// Example:
// - Migration 001: CREATE TABLE users (id INT);
// - Migration 002: ALTER TABLE users ADD email VARCHAR(255);
// -> Can merge into: CREATE TABLE users (id INT, email VARCHAR(255));

ADVANCED PARSING FEATURES

1. Handling Complex Types

pg_query_go correctly parses:

-- Array types
CREATE TABLE posts (
    tags TEXT[],
    ratings INTEGER[][]
);

-- Composite types
CREATE TYPE address AS (
    street TEXT,
    city TEXT,
    zip INTEGER
);

-- Range types
CREATE TABLE reservations (
    duration TSTZRANGE,
    availability DATERANGE
);

-- Domain types
CREATE DOMAIN email_address AS TEXT
    CHECK (VALUE ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$');

-- Enum types
CREATE TYPE status AS ENUM ('draft', 'published', 'archived');

Why this matters:

  • Other parsers might treat TEXT[] as a syntax error
  • pgsquash understands array dimensions, range bounds, domain constraints
  • Consolidation preserves all type semantics

2. Understanding Dependencies

Example: Foreign Key Dependencies

-- Migration 001
CREATE TABLE users (id SERIAL PRIMARY KEY);

-- Migration 002
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    author_id INTEGER REFERENCES users(id)
);

pg_query_go detects:

  • posts.author_id depends on users.id
  • Cannot consolidate if it violates dependency order
  • Can merge if dependencies are preserved

Result:

-- Consolidated (safe)
CREATE TABLE users (id SERIAL PRIMARY KEY);
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    author_id INTEGER REFERENCES users(id)
);
-- Dependencies respected ☑

3. Procedural Language Parsing

pg_query_go handles PL/pgSQL:

CREATE FUNCTION update_modified_timestamp()
RETURNS TRIGGER AS $$
BEGIN
    NEW.modified_at = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_modified_trigger
    BEFORE UPDATE ON users
    FOR EACH ROW
    EXECUTE FUNCTION update_modified_timestamp();

Parsed correctly:

  • Function body recognized as PL/pgSQL
  • Function arguments and return type extracted
  • Trigger definition and timing parsed
  • Consolidation preserves function logic

4. Extension-Specific Syntax

PostGIS Example:

CREATE EXTENSION postgis;

CREATE TABLE locations (
    id SERIAL PRIMARY KEY,
    name TEXT,
    geom GEOMETRY(Point, 4326)
);

CREATE INDEX idx_locations_geom
    ON locations USING GIST (geom);

pg_query_go understands:

  • GEOMETRY(Point, 4326) is a PostGIS type
  • GIST index is spatial
  • Cannot merge with standard btree indexes

COMPARISON: REGEX VS PG_QUERY_GO

REGEX PARSING

Most migration tools use regex patterns:

/CREATE\s+TABLE\s+(\w+)\s*\(/

Failures:

  • ☒ Breaks on keywords in strings
  • ☒ Misses quoted identifiers
  • ☒ Ignores schema qualifications
  • ☒ Can't handle nested parentheses
  • ☒ Fails on comments in SQL
  • ☒ Doesn't understand types

Accuracy: ~70% (fails on edge cases)

PG_QUERY_GO

Uses PostgreSQL's parser:

tree, _ := pg_query.Parse(sql)
stmt := tree.Stmts[0].Stmt.CreateStmt
tableName := stmt.Relation.Relname

Handles:

  • ☑ Keywords anywhere
  • ☑ Quoted identifiers ("table name")
  • ☑ Schema qualification (schema.table)
  • ☑ Nested structures
  • ☑ SQL comments (-- and /* */)
  • ☑ All PostgreSQL types

Accuracy: 100% (same as PostgreSQL)

REAL-WORLD EXAMPLES

Example 1: Quoted Identifiers

SQL:

CREATE TABLE "user-activity" (
    "event-type" VARCHAR(50),
    "timestamp" TIMESTAMP
);

Regex parser:

☒ Might think table name is "user"
☒ Confused by hyphens
☒ "timestamp" conflicts with keyword

pg_query_go:

☑ Table name: "user-activity"
☑ Columns: ["event-type", "timestamp"]
☑ Handles quotes correctly

Example 2: Schema Qualification

SQL:

CREATE TABLE auth.users (id SERIAL);
ALTER TABLE auth.users ADD COLUMN email TEXT;

Regex parser:

☒ Might see "auth" and "users" as separate
☒ Consolidation fails (thinks they're different tables)

pg_query_go:

☑ Schema: "auth"
☑ Table: "users"
☑ Correctly merges into one CREATE TABLE

Example 3: Complex Constraints

SQL:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    amount DECIMAL CHECK (amount > 0),
    status TEXT CHECK (status IN ('pending', 'shipped', 'delivered')),
    CONSTRAINT valid_ship_date CHECK (
        (status != 'shipped') OR (shipped_at IS NOT NULL)
    )
);

Regex parser:

☒ Can't parse nested CHECK conditions
☒ Misses constraint names
☒ Doesn't understand OR logic

pg_query_go:

☑ Parses all constraints correctly
☑ Understands boolean logic
☑ Preserves constraint names
☑ Consolidates safely

PERFORMANCE CHARACTERISTICS

PARSING SPEED

  • ~1msper simple statement
  • ~5msper complex function
  • ~2sfor 200 migrations

Parsing is NOT the bottleneck. Docker validation takes longer.

MEMORY USAGE

  • ~50MBfor 100 migrations
  • ~200MBfor 500 migrations
  • Streamingfor 1000+ migrations

Use --streaming flag for large migration sets.

WHY THIS IS A TECHNICAL MOAT

COMPETITIVE ADVANTAGE

1. IMPOSSIBLE TO REPLICATE EASILY

Writing a PostgreSQL parser from scratch is months of work. Using pg_query_go is instant but requires Go expertise.

2. ALWAYS UP-TO-DATE

When PostgreSQL 18 adds new syntax, pg_query_go updates automatically. Competitors must rewrite regex patterns.

3. ZERO EDGE CASES

If PostgreSQL can parse it, pgsquash can parse it. No "unsupported syntax" errors.

4. ENABLES ADVANCED FEATURES

Accurate parsing enables dependency resolution, dead code detection, AI analysis, and safe consolidation.

LIMITATIONS

Even pg_query_go has limits:

WHAT PG_QUERY_GO DOESN'T DO:

  • ☒ Doesn't execute SQL (that's what Docker validation is for)
  • ☒ Doesn't understand runtime behavior (function logic is opaque)
  • ☒ Doesn't connect to databases (parsing is static)
  • ☒ Doesn't validate correctness (only parses structure)

This is why pgsquash combines parsing + Docker validation for complete safety.

NEXT STEPS


Want to dive deeper into the code? Check out pgsquash-engine/internal/parser

How is this guide?

On this page