Skip to content

Add CTE strategy for eager loading HasMany/BelongsToMany associations#19318

Closed
dereuromark wants to merge 3 commits into5.nextfrom
feature/cte-eager-loading-strategy
Closed

Add CTE strategy for eager loading HasMany/BelongsToMany associations#19318
dereuromark wants to merge 3 commits into5.nextfrom
feature/cte-eager-loading-strategy

Conversation

@dereuromark
Copy link
Copy Markdown
Member

@dereuromark dereuromark commented Mar 7, 2026

Summary

This PR adds a new 'cte' strategy for eager loading HasMany and BelongsToMany associations that uses Common Table Expressions instead of WHERE IN clauses.

Refs #19107

Problem

When eager loading with large result sets, the current select strategy generates queries like:

SELECT * FROM articles WHERE author_id IN (1, 2, 3, ... 50000);

This has several issues:

  • Packet size limits: MySQL's max_allowed_packet (default 64MB) can be exceeded
  • Parsing overhead: Database must parse thousands of literal values
  • Memory round-trip: IDs fetched to PHP → converted to string → sent back to DB → parsed again

Solution

The new CTE strategy generates:

WITH _cte_Authors AS (
    SELECT id FROM authors WHERE active = 1
)
SELECT articles.* 
FROM articles
INNER JOIN _cte_Authors ON articles.author_id = _cte_Authors.id;

Performance Comparison (MySQL 8.0)

Benchmark comparing all three eager loading strategies with varying parent record counts:

Parent Records Strategy Time (ms) Memory (MB)
1,000 select 21.88 3.33
subquery 14.79 0.00
cte 17.20 0.00
5,000 select 114.48 13.33
subquery 105.10 0.00
cte 111.57 0.67
10,000 select 277.05 16.67
subquery 234.85 0.00
cte 233.17 0.67
25,000 select 745.30 49.33
subquery 591.48 0.00
cte 627.62 0.67

Key findings:

  • Both subquery and cte are faster than select for large datasets
  • select uses significantly more PHP memory (holds all IDs in array)
  • At 10K+ records, subquery and cte perform similarly
  • CTE advantage: avoids packet size limits that could hit both select and subquery with very large ID sets

Usage

// In Table class
$this->hasMany('Articles', ['strategy' => 'cte']);

// Or at query time
$authors = $this->Authors->find()
    ->contain(['Articles' => ['strategy' => 'cte']])
    ->where(['Authors.active' => true])
    ->all();

When to Use

  • Large result sets (1,000+ parent records)
  • Complex parent query conditions
  • Memory-constrained PHP environments
  • High-frequency queries
  • Scenarios where max_allowed_packet limits are a concern

Compatibility

CTEs are supported in MySQL 8.0+, PostgreSQL, SQLite 3.8.3+, and SQL Server. CakePHP 5.x already requires modern DB versions with CTE support.

This adds a new 'cte' strategy for eager loading that uses Common Table
Expressions instead of WHERE IN clauses. This is beneficial for large
result sets as it:

- Avoids packet size limits from large WHERE IN clauses
- Reduces PHP memory usage by keeping IDs in the database
- Allows the database to optimize the join more effectively

Usage:
```php
$this->hasMany('Articles', ['strategy' => 'cte']);
// Or at query time
$query->contain(['Articles' => ['strategy' => 'cte']]);
```

Refs #19107
@dereuromark dereuromark added this to the 5.4.0 milestone Mar 7, 2026
@ADmad
Copy link
Copy Markdown
Member

ADmad commented Mar 7, 2026

This should target 5.next.

@dereuromark dereuromark changed the base branch from 5.x to 5.next March 7, 2026 19:46
@ADmad ADmad requested review from Copilot and markstory March 8, 2026 05:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new eager-loading strategy ('cte') for HasMany and BelongsToMany associations that uses a Common Table Expression + join instead of WHERE IN (...), aiming to improve performance/memory usage for large parent result sets.

Changes:

  • Introduces Association::STRATEGY_CTE and enables it for HasMany/BelongsToMany.
  • Implements CTE-based filtering in SelectLoader via with() + innerJoin() against the CTE.
  • Adds HasMany test coverage for the new strategy (guarded by driver CTE support).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ORM/Association.php Adds the public STRATEGY_CTE constant.
src/ORM/Association/HasMany.php Allows the cte strategy for hasMany eager loading.
src/ORM/Association/BelongsToMany.php Allows the cte strategy for belongsToMany eager loading.
src/ORM/Association/Loader/SelectLoader.php Implements CTE-based filtering/join for eager loading.
tests/TestCase/ORM/Association/HasManyTest.php Adds tests covering HasMany eager loading using cte.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 193 to 199
if ($useSubquery) {
$filter = $this->_buildSubquery($selectQuery);
$fetchQuery = $this->_addFilteringJoin($fetchQuery, $key, $filter);
} elseif ($options['strategy'] === Association::STRATEGY_CTE) {
$cteQuery = $this->_buildSubquery($selectQuery);
$fetchQuery = $this->_addFilteringCTE($fetchQuery, $key, $cteQuery);
} else {
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cte eager-loading branch doesn't validate that the current driver supports CTEs. If users set strategy => 'cte' on an unsupported DB/version, the generated SQL will include a WITH clause and fail at runtime with a database error. Consider failing fast (eg. throw DatabaseException with a clear message) or transparently falling back to subquery/select when !$fetchQuery->getDriver()->supports(DriverFeatureEnum::CTE).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an opt-in configuration - developers explicitly choose this strategy. A reasonable counterargument is that developers should know their database capabilities.

Comment thread src/ORM/Association/BelongsToMany.php
@ADmad
Copy link
Copy Markdown
Member

ADmad commented Mar 8, 2026

@dereuromark Could add performance comparison when using the subquery strategy too?

Addresses review feedback to add test coverage for the CTE eager loading
strategy with BelongsToMany associations, including:

- Basic CTE strategy test with junction table
- Multiple parent records test
- Limit on parent query test
- Conditions on parent query test
- Junction table conditions test (using SpecialTags)

Also updates testRequiresKeys() to verify CTE strategy does not require keys.
@ADmad
Copy link
Copy Markdown
Member

ADmad commented Mar 8, 2026

Also makes me wonder why we have select as default instead of subquery. Which modern db does not support it?

@dereuromark
Copy link
Copy Markdown
Member Author

dereuromark commented Mar 8, 2026

@ADmad Updated the PR description with a full benchmark comparing all three strategies.

Regarding why select is the default instead of subquery:

Historical reasons:

  1. SQLite limitations: Older SQLite versions had quirks with correlated subqueries
  2. Simplicity: WHERE IN (1,2,3) is the most straightforward query to debug
  3. Small dataset assumption: Most typical use cases fetch <100 parent records where the difference is negligible

Modern reality:

  • MySQL 5.6+ has semi-join optimizations for subqueries
  • MySQL 8.0+, PostgreSQL, modern SQLite all handle subqueries efficiently
  • CakePHP 5.x requires modern DB versions anyway

The benchmark shows subquery is actually slightly faster than both select and cte for large datasets. Changing the default to subquery could be considered for 6.x (or 5.next?), though it would be a behavioral change worth discussing separately.

@ADmad
Copy link
Copy Markdown
Member

ADmad commented Mar 8, 2026

The benchmark shows subquery is actually slightly faster than both select and cte for large datasets.

In that case is adding the cte strategy necessary, if there's already a better alternative available? The issue poster has not clarified whether they tried using subquery strategy.

Changing the default to subquery could be considered for 6.x (or 5.next?), though it would be a behavioral change worth discussing separately.

I would vote to change the default to subquery in 5.next. The chances of it affecting someone use older database is minuscule.

@dereuromark
Copy link
Copy Markdown
Member Author

Yes, still needed, see above
Specifically: Scenarios where max_allowed_packet limits are a concern

@ADmad
Copy link
Copy Markdown
Member

ADmad commented Mar 8, 2026

Specifically: Scenarios where max_allowed_packet limits are a concern

Not sure how subquery strategy would be worse in comparison to the cte strategy in that regard.

max_allowed_packet limit is usually a concern for write operations not read operations.

max_allowed_packet limits the size of individual packets sent between the MySQL client and server. While it primarily affects WRITE operations like INSERT, UPDATE, or LOAD DATA, it can also impact SELECT queries when the result set is extremely large.

However, SELECT queries are not blocked by max_allowed_packet in the same way as INSERT. Instead, MySQL splits large result sets into multiple packets and sends them sequentially. This allows the query to succeed even if the total result size exceeds the packet limit, as long as no single packet exceeds the limit.

@dereuromark
Copy link
Copy Markdown
Member Author

dereuromark commented Mar 14, 2026

Fair.
Possible remaining CTE use cases maybe:

  • Better query plan visibility/debugging (CTEs are named, explicit) - More explicit/debuggable query structure
  • Specific database optimizers that handle CTEs better than correlated subqueries
  • Reusability when the same set is needed multiple times

Either way we should switch the default then to subquery in 5.next as follow-up.

@justindenick
Copy link
Copy Markdown

justindenick commented Mar 16, 2026

I took a look at this and there are a couple issues, first 1 is pretty trivial.

  1. I think the proper DB limit to consider is max_parameter count
    for postgres, this is Int16. No function can accept more that 65535 parameters.

  2. the cte strategy is actually far worse than IN.
    under the hood it's still has to produce the _joinData to match the associated data. I think this is how the tests are actually able to pass. Without the article_id being provided by the article_tags, there isn't a way to match any tags to an article.

in the test testCTEStrategyMultipleParents you can see this by dumping the results of the select loader.
ln(204) \Cake\Core\pr($fetchQuery->all()->toArray());
for CTE, every Tag is associated with every record from article_tags

WITH _cte_Articles AS (
SELECT (Articles.id) FROM articles Articles WHERE Articles.id <> :c0 GROUP BY Articles.id
) 
SELECT Tags.id AS Tags__id, Tags.name AS Tags__name, Tags.description AS Tags__description, Tags.created AS Tags__created FROM tags Tags INNER JOIN articles_tags ArticlesTags ON 1 = 1

_cte_Articles is never used

the cte query here, has no affect since it's not being joined or used for filtering.
There are 3 tags in the fixture and 4 article_tags. Each Tag is joined ON 1=1 to each of the article_tags where it should be something like RIGHT JOIN _cte_Articles ON (ArticleTags.article_id = _cte_Articles.id). As it is, 12 records are returned here. You can easily see this after adding a 4th Tag without and associated article_tag.

the cte itself, could be omitted in favor of EXISTS(...)

Where Tags are pulled by SELECT ArticlTags.,Tags. FROM tags Tags RIGHT JOIN article_tags ArticleTags ON(ArticleTags.tag_id = Tags.id) WHERE EXISTS(SELECT FROM articles JOIN article_tags ON(tag_id = Tags.id AND (the conditions used to find the Articles go here)

This would allow reuse the initial WHERE conditions instead of parsing the related ids and using them in IN.
The cost to that could be high if the initial conditions are complicated.

With BelongsToMany, we already have a good map of what belongs to what, using through. I haven't yet looked at a plain HasMany. (eg. comments might belong to articles with comments.articles_id as the foreign key)

Here we can still use EXISTS(...) SELECT * FROM comments Comments WHERE EXISTS(SELECT FROM articles WHERE id = Comments.article_id AND (the conditions used to find the Articles go here)

To be fair, I haven't yet looked at the subquery strategy to determine if my EXISTS theory wouldn't be better positioned there.

@dereuromark dereuromark marked this pull request as draft March 16, 2026 20:26
@dereuromark
Copy link
Copy Markdown
Member Author

dereuromark commented Mar 16, 2026

@justindenick Thanks for the detailed analysis, you're right.

Looking at the SQL output you captured, the CTE is defined but never actually joined to filter the results - the _cte_Articles appears in the WITH clause but isn't referenced in any JOIN. Combined with the ON 1 = 1 junction table join, this creates a cartesian product instead of proper filtering.

The tests passing is indeed coincidental - the ORM's result marshalling uses _joinData to match records back to parents after the query returns, which happens to produce correct-looking results with the current fixture data. Adding a tag without an associated article_tag would expose this.

Good catch on the max_allowed_packet point too - that limit primarily affects writes and large result sets, not query string length for IN clauses.

Given that:

  1. The subquery strategy already handles the large dataset use case well (and benchmarks slightly faster)
  2. The CTE implementation has fundamental issues for BelongsToMany
  3. ADmad's earlier suggestion to change the default to subquery in 5.next addresses the original issue

I'll probably close this PR. A separate PR for changing the default strategy to subquery makes more sense.

@celsowm What do you think?

@celsowm
Copy link
Copy Markdown
Contributor

celsowm commented Mar 16, 2026

I agree

@dereuromark dereuromark deleted the feature/cte-eager-loading-strategy branch March 16, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants