Objective: specific Build a production-grade, fault-tolerant, zero-copy PHP parser in Rust. Target Compliance: PHP 8.x Grammar. Key Architectural Principles:
- Strict Lifetime Separation: distinct lifetimes for Source Code (
'src) and AST/Arena ('ast). - Pure Arena Allocation: The AST contains no heap allocations (
Vec,String,Box). All data lives in theBumparena. - Resilience: The parser never panics or aborts. It produces Error Nodes and synchronizes to recover context.
- Byte-Oriented: Input is processed as
&[u8]to handle mixed encodings safely, with Spans representing byte offsets.
Spans represent byte offsets. We do not assume UTF-8 validity at the Span level, allowing the parser to handle binary strings or legacy encodings if needed.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub struct Span {
pub start: usize,
pub end: usize,
}
impl Span {
pub fn new(start: usize, end: usize) -> Self { Self { start, end } }
pub fn len(&self) -> usize { self.end - self.start }
/// Safely slice the source. Returns None if indices are out of bounds.
pub fn as_str<'src>(&self, source: &'src [u8]) -> &'src [u8] {
&source[self.start..self.end]
}
}Tokens are lightweight. Complex data (identifiers, literals) are not stored in the token; only their Span is.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
pub kind: TokenKind,
pub span: Span,
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
// Keywords (Hard)
Function, Class, If, Return,
// Keywords (Soft/Contextual - e.g. 'readonly', 'match')
Identifier,
// Literals
LNumber, // Integer
DNumber, // Float
StringLiteral, // '...' or "..."
Variable, // $var
// Symbols
Arrow, // ->
Plus,
OpenTag, // <?php
Eof,
}To ensure high cache locality and zero heap fragmentation:
- No
Box<T>: Use references&'ast T. - No
Vec<T>: Use slice references&'ast [T]. - Handle Types: Use type aliases to make signatures readable.
use bumpalo::Bump;
/// Lifetime 'ast: The duration the Arena exists.
pub type ExprId<'ast> = &'ast Expr<'ast>;
pub type StmtId<'ast> = &'ast Stmt<'ast>;All AST nodes include a span covering the entire construct.
#[derive(Debug)]
pub struct Program<'ast> {
pub statements: &'ast [StmtId<'ast>],
pub span: Span,
}
#[derive(Debug)]
pub enum Stmt<'ast> {
Echo {
exprs: &'ast [ExprId<'ast>], // Arena-backed slice
span: Span,
},
Function {
name: &'ast Token, // Reference to the identifier token
params: &'ast [Param<'ast>],
body: &'ast [StmtId<'ast>],
span: Span,
},
/// Represents a parsing failure at the Statement level
Error {
span: Span,
},
// ...
}
#[derive(Debug)]
pub enum Expr<'ast> {
Binary {
left: ExprId<'ast>,
op: BinaryOp,
right: ExprId<'ast>,
span: Span,
},
Variable {
name: Span,
span: Span,
},
/// Represents a parsing failure at the Expression level
Error {
span: Span,
},
// ...
}The Lexer is a state machine that operates on &[u8]. It accepts hints from the Parser to handle "Soft Keywords" (e.g., treating match as an identifier when following ->).
The parser controls the lexer's sensitivity to keywords.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum LexerMode {
Standard, // Normal PHP parsing
LookingForProperty, // After '->' or '::'. Keywords become identifiers.
LookingForVarName, // After '$'.
}pub struct Lexer<'src> {
input: &'src [u8],
cursor: usize,
/// Stack for interpolation (Scripting, DoubleQuote, Heredoc)
state_stack: Vec<LexerState>,
/// Current internal state (e.g. InScripting)
internal_state: LexerState,
/// Mode hint from Parser
mode: LexerMode,
}
impl<'src> Lexer<'src> {
pub fn new(input: &'src [u8]) -> Self { /* ... */ }
/// Called by the Parser/TokenSource to change context
pub fn set_mode(&mut self, mode: LexerMode) {
self.mode = mode;
}
}
impl<'src> Iterator for Lexer<'src> {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
// Logic combining internal_state + self.mode
}
}The parser orchestrates the Lexer, the Arena, and Error Recovery.
Decouples the parser from the raw lexer, enabling lookahead (LL(k)).
pub trait TokenSource<'src> {
fn current(&self) -> &Token;
fn lookahead(&self, n: usize) -> &Token;
fn bump(&mut self);
fn set_mode(&mut self, mode: LexerMode);
}Separates input lifetime ('src) from output lifetime ('ast).
pub struct Parser<'src, 'ast, T: TokenSource<'src>> {
tokens: T,
arena: &'ast Bump,
errors: Vec<ParseError>,
/// Marker to use 'src
_marker: std::marker::PhantomData<&'src ()>,
}The parser uses Error Nodes and Synchronization.
- Expected Token Missing: Record error, insert synthetic node/token if trivial, or return
Expr::Error. - Unexpected Token: Record error, advance tokens until a "Synchronization Point" (
;,},)).
impl<'src, 'ast, T: TokenSource<'src>> Parser<'src, 'ast, T> {
/// Main entry point for expressions
fn parse_expr(&mut self, min_bp: u8) -> ExprId<'ast> {
// Check binding power, recurse...
// If syntax is invalid, do NOT panic.
// self.errors.push(...);
// return self.arena.alloc(Expr::Error { span })
}
/// Synchronization helper
fn sync_to_stmt_boundary(&mut self) {
while self.tokens.current().kind != TokenKind::Eof {
match self.tokens.current().kind {
TokenKind::SemiColon | TokenKind::CloseBrace => {
self.tokens.bump();
return;
}
_ => self.tokens.bump(),
}
}
}
}This defines the library boundary.
pub struct ParseResult<'ast> {
pub program: Program<'ast>,
pub errors: Vec<ParseError>,
}
/// The main entry point.
///
/// - `source`: Raw bytes of the PHP file.
/// - `arena`: The Bump arena where AST nodes will be allocated.
pub fn parse<'src, 'ast>(
source: &'src [u8],
arena: &'ast Bump
) -> ParseResult<'ast> {
let lexer = Lexer::new(source);
let mut parser = Parser::new(lexer, arena);
parser.parse_program()
}- Lexer MVP: Implement
TokenSource,Lexerwith basic states (Initial,Scripting). - Arena Setup: Integrate
bumpalo. - AST Skeleton: Define basic
StmtandExprstructs. - Test Harness: Setup
instafor snapshot testing.
- Precedence Table: Map PHP precedence to Binding Powers.
- Operators: Implement Binary, Unary, Ternary, and
instanceof. - Error Nodes: Ensure malformed math (e.g.,
1 + * 2) producesExpr::Error.
- Block Parsing: Handle
{ ... }and scopes. - Control Structures:
if,while,return. - Synchronization: Implement
sync_to_stmt_boundaryto recover from missing semicolons.
- Interpolation Stack:
DoubleQuotes,Heredoc,Backticks. - Complex Identifiers: Support
LexerMode::LookingForPropertyfor$obj->class.
- Unit Tests: For individual Lexer state transitions.
- Snapshot Tests (Insta):
- Input:
test.php - Output: Textual representation of the AST (Debug fmt).
- Purpose: Catch regressions in tree structure.
- Input:
- Recovery Tests:
- Input:
<?php echo 1 + ; echo "done"; - Assert:
program.statements[0]isEcho(Expr::Error). - Assert:
program.statements[1]isEcho("done"). - The parser must not stop at the first semicolon error.
- Input:
- Corpus Tests: Parse large open-source PHP projects (WordPress, Laravel) to ensure no panics on valid code.