SQL Parsing with pg_query_go
How pgsquash-engine uses PostgreSQL's actual parser for accurate SQL analysis
SQL PARSING WITH PG_QUERY_GO
Understanding the technology: pgsquash-engine uses PostgreSQL's actual parser
💡 This is the underlying technology that powers CAPYSQUASH Platform, capysquash-cli, and pgsquash-engine.
THE CHALLENGE
PostgreSQL's SQL dialect is notoriously complex:
WHY SQL PARSING IS HARD
- 1.Context-Sensitive Grammar - The same syntax means different things in different contexts
- 2.Thousands of Keywords - Reserved words, type names, function names that overlap
- 3.Complex Type System - Arrays, composites, ranges, domains, ENUMs with custom casts
- 4.Procedural Languages - PL/pgSQL, PL/Python, PL/v8 embedded in SQL
- 5.Extensions - PostGIS, pg_trgm, uuid-ossp add their own syntax
- 6.Version Differences - New features in PG 12, 13, 14, 15, 16, 17
THE SOLUTION: PG_QUERY_GO
pgsquash doesn't reimplement PostgreSQL's parser. It uses PostgreSQL's actual parser.
THE ADVANTAGE
pg_query_go is a Go binding to libpg_query, which embeds PostgreSQL's actual C parser.
This means pgsquash-engine parses SQL the same way PostgreSQL does - using the actual parser code from PostgreSQL itself.
What is pg_query_go?
┌─────────────────────────────────────────────────────────────┐
│ CAPYSQUASH Platform + capysquash-cli (Go) │
├─────────────────────────────────────────────────────────────┤
│ pgsquash-engine (Go library) │
├─────────────────────────────────────────────────────────────┤
│ pg_query_go (Go bindings) │
├─────────────────────────────────────────────────────────────┤
│ libpg_query (C library) │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL Parser (gram.y, parse_*.c) │
│ The same code PostgreSQL uses internally │
└─────────────────────────────────────────────────────────────┘Key Properties:
- Maintained by PostgreSQL core team (indirectly)
- Updated with every PostgreSQL release
- 100% compatibility with PostgreSQL behavior
- Parses to Abstract Syntax Trees (AST)
- Zero ambiguity in parsing
PARSING WORKFLOW
Step 1: Load Migration Files
// pgsquash reads your SQL files
files, err := loader.LoadMigrations("migrations/*.sql")
// files = ["001_create_users.sql", "002_add_email.sql", ...]Step 2: Parse to AST
import "github.com/pganalyze/pg_query_go/v5"
// Parse SQL to Abstract Syntax Tree
tree, err := pg_query.Parse(sqlContent)
// tree contains structured representation:
// - Statement type (CREATE TABLE, ALTER TABLE, etc.)
// - Object names (table name, column names)
// - Data types
// - Constraints
// - DependenciesExample AST Output
Input SQL:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT NOW()
);Parsed AST (simplified):
{
"stmts": [{
"stmt": {
"CreateStmt": {
"relation": {
"relname": "users",
"schemaname": "public"
},
"tableElts": [
{
"ColumnDef": {
"colname": "id",
"typeName": { "names": ["pg_catalog", "int4"] },
"constraints": [{
"Constraint": {
"contype": "CONSTR_PRIMARY"
}
}]
}
},
{
"ColumnDef": {
"colname": "email",
"typeName": { "names": ["pg_catalog", "varchar"] },
"typemod": 259, // length 255
"constraints": [
{ "Constraint": { "contype": "CONSTR_NOTNULL" } },
{ "Constraint": { "contype": "CONSTR_UNIQUE" } }
]
}
}
]
}
}
}]
}Step 3: Extract Metadata
// pgsquash analyzes the AST to extract:
// - Table name: "users"
// - Columns: ["id", "email", "created_at"]
// - Types: [int4, varchar(255), timestamp]
// - Constraints: [PRIMARY KEY, NOT NULL, UNIQUE]
// - Defaults: [nextval('users_id_seq'), none, NOW()]Step 4: Build Dependency Graph
// Track relationships between objects
dependencies := tracker.BuildDependencyGraph(statements)
// Example relationships:
// - users.id (column) depends on users (table)
// - users_id_seq (sequence) depends on users.id (column)
// - users_email_key (constraint) depends on users.email (column)Step 5: Identify Consolidation Opportunities
// Find operations that can be safely merged
opportunities := squasher.FindMergeablehooks(dependencies, safetyLevel)
// Example:
// - Migration 001: CREATE TABLE users (id INT);
// - Migration 002: ALTER TABLE users ADD email VARCHAR(255);
// -> Can merge into: CREATE TABLE users (id INT, email VARCHAR(255));ADVANCED PARSING FEATURES
1. Handling Complex Types
pg_query_go correctly parses:
-- Array types
CREATE TABLE posts (
tags TEXT[],
ratings INTEGER[][]
);
-- Composite types
CREATE TYPE address AS (
street TEXT,
city TEXT,
zip INTEGER
);
-- Range types
CREATE TABLE reservations (
duration TSTZRANGE,
availability DATERANGE
);
-- Domain types
CREATE DOMAIN email_address AS TEXT
CHECK (VALUE ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$');
-- Enum types
CREATE TYPE status AS ENUM ('draft', 'published', 'archived');Why this matters:
- Other parsers might treat
TEXT[]as a syntax error - pgsquash understands array dimensions, range bounds, domain constraints
- Consolidation preserves all type semantics
2. Understanding Dependencies
Example: Foreign Key Dependencies
-- Migration 001
CREATE TABLE users (id SERIAL PRIMARY KEY);
-- Migration 002
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
author_id INTEGER REFERENCES users(id)
);pg_query_go detects:
posts.author_iddepends onusers.id- Cannot consolidate if it violates dependency order
- Can merge if dependencies are preserved
Result:
-- Consolidated (safe)
CREATE TABLE users (id SERIAL PRIMARY KEY);
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
author_id INTEGER REFERENCES users(id)
);
-- Dependencies respected ☑3. Procedural Language Parsing
pg_query_go handles PL/pgSQL:
CREATE FUNCTION update_modified_timestamp()
RETURNS TRIGGER AS $$
BEGIN
NEW.modified_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_modified_trigger
BEFORE UPDATE ON users
FOR EACH ROW
EXECUTE FUNCTION update_modified_timestamp();Parsed correctly:
- Function body recognized as PL/pgSQL
- Function arguments and return type extracted
- Trigger definition and timing parsed
- Consolidation preserves function logic
4. Extension-Specific Syntax
PostGIS Example:
CREATE EXTENSION postgis;
CREATE TABLE locations (
id SERIAL PRIMARY KEY,
name TEXT,
geom GEOMETRY(Point, 4326)
);
CREATE INDEX idx_locations_geom
ON locations USING GIST (geom);pg_query_go understands:
GEOMETRY(Point, 4326)is a PostGIS typeGISTindex is spatial- Cannot merge with standard btree indexes
COMPARISON: REGEX VS PG_QUERY_GO
REGEX PARSING
Most migration tools use regex patterns:
/CREATE\s+TABLE\s+(\w+)\s*\(/Failures:
- ☒ Breaks on keywords in strings
- ☒ Misses quoted identifiers
- ☒ Ignores schema qualifications
- ☒ Can't handle nested parentheses
- ☒ Fails on comments in SQL
- ☒ Doesn't understand types
Accuracy: ~70% (fails on edge cases)
PG_QUERY_GO
Uses PostgreSQL's parser:
tree, _ := pg_query.Parse(sql)
stmt := tree.Stmts[0].Stmt.CreateStmt
tableName := stmt.Relation.RelnameHandles:
- ☑ Keywords anywhere
- ☑ Quoted identifiers ("table name")
- ☑ Schema qualification (schema.table)
- ☑ Nested structures
- ☑ SQL comments (-- and /* */)
- ☑ All PostgreSQL types
Accuracy: 100% (same as PostgreSQL)
REAL-WORLD EXAMPLES
Example 1: Quoted Identifiers
SQL:
CREATE TABLE "user-activity" (
"event-type" VARCHAR(50),
"timestamp" TIMESTAMP
);Regex parser:
☒ Might think table name is "user"
☒ Confused by hyphens
☒ "timestamp" conflicts with keywordpg_query_go:
☑ Table name: "user-activity"
☑ Columns: ["event-type", "timestamp"]
☑ Handles quotes correctlyExample 2: Schema Qualification
SQL:
CREATE TABLE auth.users (id SERIAL);
ALTER TABLE auth.users ADD COLUMN email TEXT;Regex parser:
☒ Might see "auth" and "users" as separate
☒ Consolidation fails (thinks they're different tables)pg_query_go:
☑ Schema: "auth"
☑ Table: "users"
☑ Correctly merges into one CREATE TABLEExample 3: Complex Constraints
SQL:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
amount DECIMAL CHECK (amount > 0),
status TEXT CHECK (status IN ('pending', 'shipped', 'delivered')),
CONSTRAINT valid_ship_date CHECK (
(status != 'shipped') OR (shipped_at IS NOT NULL)
)
);Regex parser:
☒ Can't parse nested CHECK conditions
☒ Misses constraint names
☒ Doesn't understand OR logicpg_query_go:
☑ Parses all constraints correctly
☑ Understands boolean logic
☑ Preserves constraint names
☑ Consolidates safelyPERFORMANCE CHARACTERISTICS
PARSING SPEED
- ~1msper simple statement
- ~5msper complex function
- ~2sfor 200 migrations
Parsing is NOT the bottleneck. Docker validation takes longer.
MEMORY USAGE
- ~50MBfor 100 migrations
- ~200MBfor 500 migrations
- Streamingfor 1000+ migrations
Use --streaming flag for large migration sets.
WHY THIS IS A TECHNICAL MOAT
COMPETITIVE ADVANTAGE
1. IMPOSSIBLE TO REPLICATE EASILY
Writing a PostgreSQL parser from scratch is months of work. Using pg_query_go is instant but requires Go expertise.
2. ALWAYS UP-TO-DATE
When PostgreSQL 18 adds new syntax, pg_query_go updates automatically. Competitors must rewrite regex patterns.
3. ZERO EDGE CASES
If PostgreSQL can parse it, pgsquash can parse it. No "unsupported syntax" errors.
4. ENABLES ADVANCED FEATURES
Accurate parsing enables dependency resolution, dead code detection, AI analysis, and safe consolidation.
LIMITATIONS
Even pg_query_go has limits:
WHAT PG_QUERY_GO DOESN'T DO:
- ☒ Doesn't execute SQL (that's what Docker validation is for)
- ☒ Doesn't understand runtime behavior (function logic is opaque)
- ☒ Doesn't connect to databases (parsing is static)
- ☒ Doesn't validate correctness (only parses structure)
This is why pgsquash combines parsing + Docker validation for complete safety.
NEXT STEPS
- Docker Validation - How execution validation works
- Dependency Resolution - Using AST for safe ordering
- Commands Reference - Using the CLI
Want to dive deeper into the code? Check out pgsquash-engine/internal/parser
How is this guide?