DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042
Open
rymarm wants to merge 1 commit intoapache:masterfrom
Open
DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042rymarm wants to merge 1 commit intoapache:masterfrom
rymarm wants to merge 1 commit intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRILL-8545: COLLECT_TO_LIST_VARCHAR function returns incorrect result when Hash Aggregator operator used
Description
Root cause
The
collect_to_list_varcharfunction is incompatible with the Hash Aggregator because the aggregator processes data in a non-sequential manner, while the underlyingValueVectorframework requires sequential writes for variable-length data. Furthermore, the Drill UDF framework lacks a straightforward mechanism to buffer these values internally before flushing them to the output vector, making it impossible to reorder them on the fly during the aggregation phase.Solution
Solution
To ensure data integrity and prevent index out-of-bounds exceptions, I have modified the Hash Aggregator physical planning rule. The planner will now explicitly disallow the Hash Aggregator if a
collect_to_list_varcharcall is detected in the aggregate expression. This forces the optimizer to fall back to the Streaming Aggregator, which provides the necessary ordered input.Documentation
No changes.
Testing
Updated the available unit test cases so they cover the mentioned problem.