Skip to content

Latest commit

 

History

History
325 lines (254 loc) · 8.54 KB

File metadata and controls

325 lines (254 loc) · 8.54 KB

PHP Parser Architecture & Design Specification (v2.0)

1. Overview

Objective: specific Build a production-grade, fault-tolerant, zero-copy PHP parser in Rust. Target Compliance: PHP 8.x Grammar. Key Architectural Principles:

  1. Strict Lifetime Separation: distinct lifetimes for Source Code ('src) and AST/Arena ('ast).
  2. Pure Arena Allocation: The AST contains no heap allocations (Vec, String, Box). All data lives in the Bump arena.
  3. Resilience: The parser never panics or aborts. It produces Error Nodes and synchronizes to recover context.
  4. Byte-Oriented: Input is processed as &[u8] to handle mixed encodings safely, with Spans representing byte offsets.

2. Core Data Structures

2.1. Spans (Source Mapping)

Spans represent byte offsets. We do not assume UTF-8 validity at the Span level, allowing the parser to handle binary strings or legacy encodings if needed.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub struct Span {
    pub start: usize,
    pub end: usize,
}

impl Span {
    pub fn new(start: usize, end: usize) -> Self { Self { start, end } }
    
    pub fn len(&self) -> usize { self.end - self.start }
    
    /// Safely slice the source. Returns None if indices are out of bounds.
    pub fn as_str<'src>(&self, source: &'src [u8]) -> &'src [u8] {
        &source[self.start..self.end]
    }
}

2.2. Tokens (The Lexeme)

Tokens are lightweight. Complex data (identifiers, literals) are not stored in the token; only their Span is.

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
    pub kind: TokenKind,
    pub span: Span,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    // Keywords (Hard)
    Function, Class, If, Return,
    // Keywords (Soft/Contextual - e.g. 'readonly', 'match')
    Identifier, 
    // Literals
    LNumber, // Integer
    DNumber, // Float
    StringLiteral, // '...' or "..."
    Variable,      // $var
    // Symbols
    Arrow, // ->
    Plus,
    OpenTag, // <?php
    Eof,
}

3. The AST (Abstract Syntax Tree)

3.1. Memory Layout Strategy

To ensure high cache locality and zero heap fragmentation:

  1. No Box<T>: Use references &'ast T.
  2. No Vec<T>: Use slice references &'ast [T].
  3. Handle Types: Use type aliases to make signatures readable.
use bumpalo::Bump;

/// Lifetime 'ast: The duration the Arena exists.
pub type ExprId<'ast> = &'ast Expr<'ast>;
pub type StmtId<'ast> = &'ast Stmt<'ast>;

3.2. AST Definitions

All AST nodes include a span covering the entire construct.

#[derive(Debug)]
pub struct Program<'ast> {
    pub statements: &'ast [StmtId<'ast>],
    pub span: Span,
}

#[derive(Debug)]
pub enum Stmt<'ast> {
    Echo {
        exprs: &'ast [ExprId<'ast>], // Arena-backed slice
        span: Span,
    },
    Function {
        name: &'ast Token, // Reference to the identifier token
        params: &'ast [Param<'ast>],
        body: &'ast [StmtId<'ast>],
        span: Span,
    },
    /// Represents a parsing failure at the Statement level
    Error {
        span: Span,
    },
    // ...
}

#[derive(Debug)]
pub enum Expr<'ast> {
    Binary {
        left: ExprId<'ast>,
        op: BinaryOp,
        right: ExprId<'ast>,
        span: Span,
    },
    Variable {
        name: Span,
        span: Span,
    },
    /// Represents a parsing failure at the Expression level
    Error {
        span: Span,
    },
    // ...
}

4. The Lexer (Context & State)

The Lexer is a state machine that operates on &[u8]. It accepts hints from the Parser to handle "Soft Keywords" (e.g., treating match as an identifier when following ->).

4.1. Lexer Modes

The parser controls the lexer's sensitivity to keywords.

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum LexerMode {
    Standard,           // Normal PHP parsing
    LookingForProperty, // After '->' or '::'. Keywords become identifiers.
    LookingForVarName,  // After '$'. 
}

4.2. Lexer Implementation

pub struct Lexer<'src> {
    input: &'src [u8],
    cursor: usize,
    /// Stack for interpolation (Scripting, DoubleQuote, Heredoc)
    state_stack: Vec<LexerState>, 
    /// Current internal state (e.g. InScripting)
    internal_state: LexerState,
    /// Mode hint from Parser
    mode: LexerMode,
}

impl<'src> Lexer<'src> {
    pub fn new(input: &'src [u8]) -> Self { /* ... */ }

    /// Called by the Parser/TokenSource to change context
    pub fn set_mode(&mut self, mode: LexerMode) {
        self.mode = mode;
    }
}

impl<'src> Iterator for Lexer<'src> {
    type Item = Token;
    fn next(&mut self) -> Option<Self::Item> {
        // Logic combining internal_state + self.mode
    }
}

5. The Parser (Recursive Descent + Pratt)

The parser orchestrates the Lexer, the Arena, and Error Recovery.

5.1. Token Source Abstraction

Decouples the parser from the raw lexer, enabling lookahead (LL(k)).

pub trait TokenSource<'src> {
    fn current(&self) -> &Token;
    fn lookahead(&self, n: usize) -> &Token;
    fn bump(&mut self);
    fn set_mode(&mut self, mode: LexerMode);
}

5.2. Parser Struct

Separates input lifetime ('src) from output lifetime ('ast).

pub struct Parser<'src, 'ast, T: TokenSource<'src>> {
    tokens: T,
    arena: &'ast Bump,
    errors: Vec<ParseError>,
    /// Marker to use 'src
    _marker: std::marker::PhantomData<&'src ()>, 
}

5.3. Error Recovery Strategy

The parser uses Error Nodes and Synchronization.

  1. Expected Token Missing: Record error, insert synthetic node/token if trivial, or return Expr::Error.
  2. Unexpected Token: Record error, advance tokens until a "Synchronization Point" (;, }, )).
impl<'src, 'ast, T: TokenSource<'src>> Parser<'src, 'ast, T> {
    
    /// Main entry point for expressions
    fn parse_expr(&mut self, min_bp: u8) -> ExprId<'ast> {
        // Check binding power, recurse...
        // If syntax is invalid, do NOT panic.
        // self.errors.push(...);
        // return self.arena.alloc(Expr::Error { span })
    }

    /// Synchronization helper
    fn sync_to_stmt_boundary(&mut self) {
        while self.tokens.current().kind != TokenKind::Eof {
            match self.tokens.current().kind {
                TokenKind::SemiColon | TokenKind::CloseBrace => {
                    self.tokens.bump();
                    return;
                }
                _ => self.tokens.bump(),
            }
        }
    }
}

6. Public API

This defines the library boundary.

pub struct ParseResult<'ast> {
    pub program: Program<'ast>,
    pub errors: Vec<ParseError>,
}

/// The main entry point.
/// 
/// - `source`: Raw bytes of the PHP file.
/// - `arena`: The Bump arena where AST nodes will be allocated.
pub fn parse<'src, 'ast>(
    source: &'src [u8], 
    arena: &'ast Bump
) -> ParseResult<'ast> {
    let lexer = Lexer::new(source);
    let mut parser = Parser::new(lexer, arena);
    parser.parse_program()
}

7. Development Phases

Phase 1: Infrastructure & Basics

  1. Lexer MVP: Implement TokenSource, Lexer with basic states (Initial, Scripting).
  2. Arena Setup: Integrate bumpalo.
  3. AST Skeleton: Define basic Stmt and Expr structs.
  4. Test Harness: Setup insta for snapshot testing.

Phase 2: Expression Engine (Pratt)

  1. Precedence Table: Map PHP precedence to Binding Powers.
  2. Operators: Implement Binary, Unary, Ternary, and instanceof.
  3. Error Nodes: Ensure malformed math (e.g., 1 + * 2) produces Expr::Error.

Phase 3: Statements & Control Flow

  1. Block Parsing: Handle { ... } and scopes.
  2. Control Structures: if, while, return.
  3. Synchronization: Implement sync_to_stmt_boundary to recover from missing semicolons.

Phase 4: Advanced Lexing

  1. Interpolation Stack: DoubleQuotes, Heredoc, Backticks.
  2. Complex Identifiers: Support LexerMode::LookingForProperty for $obj->class.

8. Testing Strategy

  1. Unit Tests: For individual Lexer state transitions.
  2. Snapshot Tests (Insta):
    • Input: test.php
    • Output: Textual representation of the AST (Debug fmt).
    • Purpose: Catch regressions in tree structure.
  3. Recovery Tests:
    • Input: <?php echo 1 + ; echo "done";
    • Assert: program.statements[0] is Echo(Expr::Error).
    • Assert: program.statements[1] is Echo("done").
    • The parser must not stop at the first semicolon error.
  4. Corpus Tests: Parse large open-source PHP projects (WordPress, Laravel) to ensure no panics on valid code.