Skip to content

feat: optimize count(*) for pk/append table#317

Open
lucasfang wants to merge 12 commits into
alibaba:mainfrom
lucasfang:count_optimize
Open

feat: optimize count(*) for pk/append table#317
lucasfang wants to merge 12 commits into
alibaba:mainfrom
lucasfang:count_optimize

Conversation

@lucasfang
Copy link
Copy Markdown
Collaborator

@lucasfang lucasfang commented May 29, 2026

Purpose

Linked issue: close #321

This PR adds an optimized count(*) path for both PK and Append tables by introducing a dedicated CountReader API and table-specific implementations.

Main changes:

  • Add CountReader API and TableRead::CreateCountReader(splits).
  • Implement PK count path via PKCountReader (metadata path + MOR merge path).
  • Implement Append count path via AppendCountReader (metadata-first).
  • Keep current unsupported boundaries explicit:
    • predicate pushdown is not supported in CountReader path
    • force_keep_delete is not supported in PK CountReader path

Tests

UT:

  • Add src/paimon/core/table/source/pk_count_reader_test.cpp
    • MOR snapshot count
    • DV snapshot count
    • empty splits
    • invalid split type
  • Add src/paimon/core/table/source/append_count_reader_test.cpp
    • append snapshot count
    • empty splits
    • invalid split type
    • predicate-not-supported behavior

IT:

  • Update test/inte/scan_and_read_inte_test.cpp
    • merge count assertions into corresponding original PK scan/read tests
    • keep consistency check between CreateReader row count and CreateCountReader result
    • keep unsupported-option tests

API and Format

API impact:

  • Add include/paimon/reader/count_reader.h
  • Add TableRead::CreateCountReader(splits) in include/paimon/table/source/table_read.h

Storage format / protocol impact:

  • No storage format change
  • No protocol compatibility change

Documentation

No additional user-facing documentation is required for this PR.

Generative AI tooling

Generated-by: GitHub Copilot (GPT-5.3-Codex)

Copilot AI review requested due to automatic review settings May 29, 2026 02:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optimized CountReader API for PK tables. The default TableRead::CreateCountReader returns NotImplemented, while KeyValueTableRead provides a real implementation built on a new CountSplitRead that picks between a metadata-only count for raw-convertible/DV splits and a merge + drop-delete + count path (via a new RowCountAccumulator) for MOR splits, using a column-pruned read schema. Predicate pushdown and force_keep_delete are rejected for now. Integration tests verify counts against existing full-read results, plus empty-splits and unsupported-option cases.

Changes:

  • New CountReader abstraction (include/paimon/reader/count_reader.h) and TableRead::CreateCountReader with PK implementation via CountSplitRead + RowCountAccumulator.
  • Metadata-based counting reuses DataSplitImpl::PartialMergedRowCount; MOR path reuses MergeFileSplitRead::CreateSortMergeReaderForSection(..., drop_delete=true) with a trimmed PK-only schema (added InternalReadContext::CreateWithSchema and MergeFileSplitRead::GetKeyComparator).
  • Adds count integration tests and an unrelated NormalizePath scheme test.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
include/paimon/reader/count_reader.h New CountReader interface with CountRows().
include/paimon/table/source/table_read.h Adds virtual CreateCountReader(splits) on TableRead.
src/paimon/core/table/source/table_read.cpp Default CreateCountReader returns NotImplemented.
src/paimon/core/table/source/key_value_table_read.{h,cpp} PK CreateCountReader implementation; rejects predicate and force_keep_delete; stores path factory/context/executor.
src/paimon/core/operation/count_split_read.{h,cpp} New CountSplitRead with metadata-count + merge-count strategies and PK-only read schema.
src/paimon/core/mergetree/row_count_accumulator.{h,cpp} New helper that iterates SortMergeReader output and counts rows.
src/paimon/core/operation/internal_read_context.{h,cpp} Adds CreateWithSchema factory to clone context with a new read schema.
src/paimon/core/operation/merge_file_split_read.h Exposes GetKeyComparator() for use by count path.
src/paimon/CMakeLists.txt Registers the two new source files.
test/inte/scan_and_read_inte_test.cpp Adds count IT cases (MOR/DV/empty/consistency/predicate/force-keep-delete).
src/paimon/common/utils/path_util_test.cpp Unrelated test for NormalizePath with URI scheme.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/common/utils/path_util_test.cpp Outdated
Comment thread include/paimon/reader/count_reader.h
Comment thread src/paimon/CMakeLists.txt Outdated
Comment thread src/paimon/core/operation/count_split_read.cpp Outdated
Comment thread src/paimon/core/operation/count_split_read.cpp Outdated
@lucasfang lucasfang force-pushed the count_optimize branch 3 times, most recently from 469ae2d to 9ffcc24 Compare May 29, 2026 06:47
@lxy-9602
Copy link
Copy Markdown
Collaborator

Please add a issue for this pr.

Comment thread include/paimon/table/source/table_read.h
Comment thread src/paimon/common/utils/path_util_test.cpp
Comment thread src/paimon/core/operation/count_split_read.cpp Outdated
Comment thread src/paimon/core/operation/count_split_read.cpp Outdated
Comment thread src/paimon/core/operation/count_split_read.cpp Outdated
Comment thread src/paimon/core/table/source/key_value_table_read.cpp
Comment thread test/inte/scan_and_read_inte_test.cpp Outdated
@lucasfang lucasfang changed the title feat: optimize count(*) for pk table feat: optimize count(*) for pk/append table Jun 1, 2026
}

return split->RowCount();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFileMeta may not contain deletion information (e.g., data_deletion_files_.cardinality == nullopt), and falling back to split->RowCount() would ignore deletions, which is not correct.

Comment thread src/paimon/core/table/source/pk_count_reader.cpp Outdated
Comment thread src/paimon/core/table/source/pk_count_reader.cpp Outdated
Comment thread test/inte/scan_and_read_inte_test.cpp
ASSERT_OK_AND_ASSIGN(int64_t count, count_reader->CountRows());
ASSERT_EQ(count, 0);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to understand whether these tests actually cover the case where the delete info in older versions is nullopt, so the code falls back to the merge read path. Also, could you please add an end-to-end test for append tables as well?

/// This replaces the entire KeyValueProjectionReader + FieldMappingReader + PredicateBatchReader
/// chain in COUNT(*) queries. Instead of materializing Arrow batches, it directly iterates
/// merged KeyValue objects and counts them.
///
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not quite following this comment. Why would entire replace FieldMappingReader? If merging is needed, aren’t those still necessary? Please double check all the comments @dalingmeng

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also PredicateBatchReader is only optional, and in most cases, there is no PredicateBatchReader.

Comment thread src/paimon/core/table/source/pk_count_reader.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] optimize count(*) for PK table

4 participants