Skip to content

HYPERFLEET-556 - doc - Sentinel resource profiling result and recommendation#86

Open
pnguyen44 wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-556
Open

HYPERFLEET-556 - doc - Sentinel resource profiling result and recommendation#86
pnguyen44 wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-556

Conversation

@pnguyen44
Copy link
Copy Markdown
Contributor

@pnguyen44 pnguyen44 commented Mar 26, 2026

Summary

  • Document Sentinel resource profiling results at 100, 1000, and 5000 clusters per poll cycle
  • Current defaults (100m/500m CPU, 128Mi/512Mi memory) validated as safe up to ~1000 clusters per instance, no changes to values.yaml
  • Document recommendations for operators managing beyond 1000 clusters per instance

Jira ticket

Test Plan

  • Unit tests added/updated
  • make test-all passes
  • make lint passes
  • Helm chart changes validated with make test-helm (if applicable)
  • Deployed to a development cluster and verified (if Helm/config changes)
  • E2E tests passed (if cross-component or major changes)

Summary by CodeRabbit

  • Documentation
    • Added a "Sentinel Profiling Results" page detailing the profiling experiment and findings across cluster scales.
    • Summarizes CPU and memory observations (CPU generally within limits; memory grows with cluster count and can exceed requests at extreme scale) and notes saturation behavior at highest scale.
    • Provides sizing recommendations (~1000 clusters per instance), scaling/sharding guidance, optional VPA note, and links to related docs.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ciaranroche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

A new docs/resource-profiling.md page titled "Sentinel Profiling Results" was added. It describes the profiling goal (validate configured CPU/memory requests & limits), experiment setup (5s poll, 15-minute scenarios, 60s kubectl top sampling, tracing disabled, in-cluster mock-hyperfleet-api), and cluster sizes tested (100, 1000, 5000). The page records aggregated CPU and memory ranges and averages per cluster count, notes CPU stays below the 500m limit while the 128Mi memory request is exceeded at 5000 clusters, reports cycle-time saturation at 5000 clusters, and gives sizing/sharding guidance (~1000 clusters per instance, use resourceSelector) plus related links.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding documentation of Sentinel resource profiling results and recommendations, which directly matches the addition of the resource-profiling.md document.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/resource-profiling.md`:
- Around line 12-13: Update the profiling guidance that currently lists
"Duration" and "Sample interval" so sampling is denser and runs longer: change
the Sample interval from 60s to a finer resolution (e.g., 5–10s) and increase
Duration from 5 min to a longer window (e.g., 15–30 min or multiple runs) to
capture transients; add a requirement to record percentile metrics (p95, p99) in
addition to average/peak and document how to aggregate multiple runs; apply
these changes to the sections that mention "Duration" and "Sample interval" and
to the sizing guidance that references averages/peaks so recommendations use
percentiles and longer, higher-resolution sampling.
- Around line 33-34: Update the documentation to add an explicit mitigation for
poll-interval overrun noted in the "At 5000 clusters with a 5s poll interval"
sentence: add a recommendation entry in the existing recommendation table (and
mirror it where lines 39-47 discuss scaling) that suggests concrete operational
controls such as increasing the poll interval, sharding workloads across
multiple Sentinel instances, and introducing a horizontal-scaling trigger (e.g.,
queue length or CPU threshold) to prevent continuous back-to-back runs; ensure
the new table row references the "poll interval" condition and gives brief
actionable steps and thresholds for large-scale deployments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 526c47a0-716a-43d0-8346-dec85b183ce5

📥 Commits

Reviewing files that changed from the base of the PR and between 5dcd07f and c1a97e1.

📒 Files selected for processing (1)
  • docs/resource-profiling.md

@pnguyen44 pnguyen44 marked this pull request as draft March 26, 2026 18:56
@pnguyen44 pnguyen44 marked this pull request as ready for review March 26, 2026 20:30
@openshift-ci openshift-ci bot requested a review from rh-amarin March 26, 2026 20:30
@pnguyen44 pnguyen44 marked this pull request as draft March 26, 2026 20:50
@pnguyen44 pnguyen44 marked this pull request as ready for review March 27, 2026 16:18
@openshift-ci openshift-ci bot requested a review from aredenba-rh March 27, 2026 16:18
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/resource-profiling.md (1)

41-45: Recommendations need concrete per-tier sizing values.

This section stays generic (“increase resource limits”) even though the doc presents scale-specific profiling. Add explicit request/limit recommendations per tier so operators can apply sizing directly and consistently.

Suggested doc enhancement
 ## Recommendations
 
-Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource limits or shard across multiple instances using `resource_selector`. The results table above can be used to guide sizing.
+Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource limits or shard across multiple instances using `resource_selector`.
+
+Suggested starting tiers from this profiling run:
+
+| Tier | Cluster count | CPU request / limit | Memory request / limit |
+| ---- | ------------- | ------------------- | ---------------------- |
+| Small | up to 100 | 100m / 500m | 128Mi / 512Mi |
+| Medium | up to 1000 | 100m / 500m | 128Mi / 512Mi |
+| Large | up to 5000 | 150m / 500m | 192Mi / 512Mi |
+
+Re-profile after changing poll interval, selectors, or workload shape.
 
 A Vertical Pod Autoscaler (VPA) could automatically adjust requests to match actual usage, removing the need to manually tune. Most useful if cluster count drifts over time.

As per coding guidelines, "**`: -Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/resource-profiling.md` around lines 41 - 45, Update the
"Recommendations" section to replace the generic guidance with concrete per-tier
CPU and memory request/limit values tied to cluster-count tiers; specifically,
add a small/medium/large/xlarge table or list that maps cluster ranges (e.g.,
0–250, 251–500, 501–1000, >1000) to explicit requests/limits (starting from the
current defaults CPU 100m/500m, memory 128Mi/512Mi and increasing appropriately
per tier), include actionable guidance for sharding with resourceSelector and
when to enable VPA, and ensure the wording under the "Recommendations" heading
clearly states the numeric values so operators can apply them directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/resource-profiling.md`:
- Line 43: The docs reference the wrong config key name for sharding
guidance—replace all occurrences of the camelCase key "resourceSelector" with
the actual YAML key "resource_selector" (including the sentence that currently
reads "shard across multiple instances using `resourceSelector`" and any
examples/tables that show the key) so operators follow the correct config;
verify consistency across the document and update any related example snippets,
headings, or cross-references that mention the camelCase variant.

---

Nitpick comments:
In `@docs/resource-profiling.md`:
- Around line 41-45: Update the "Recommendations" section to replace the generic
guidance with concrete per-tier CPU and memory request/limit values tied to
cluster-count tiers; specifically, add a small/medium/large/xlarge table or list
that maps cluster ranges (e.g., 0–250, 251–500, 501–1000, >1000) to explicit
requests/limits (starting from the current defaults CPU 100m/500m, memory
128Mi/512Mi and increasing appropriately per tier), include actionable guidance
for sharding with resourceSelector and when to enable VPA, and ensure the
wording under the "Recommendations" heading clearly states the numeric values so
operators can apply them directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 601531c3-478f-4c26-ab6d-ed158b221a22

📥 Commits

Reviewing files that changed from the base of the PR and between 193058a and 17bf6e4.

📒 Files selected for processing (1)
  • docs/resource-profiling.md

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/resource-profiling.md (1)

39-43: ⚠️ Potential issue | 🟡 Minor

Add poll interval adjustment as an explicit mitigation option.

Line 39 identifies that Sentinel is saturated at 5000 clusters with the 5s poll interval, but the recommendation in line 43 doesn't mention increasing the poll interval as a mitigation option. Including this would give operators a simpler alternative to sharding or resource scaling.

📝 Suggested addition
-Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing.
+Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource requests and limits, increase the poll interval (e.g., 10s) to allow idle time between cycles, or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/resource-profiling.md` around lines 39 - 43, The recommendations section
for Sentinel capacity omits increasing the poll interval as a mitigation; update
the text in the "Recommendations" paragraph that currently references defaults
(CPU 100m/500m, memory 128Mi/512Mi) and `resourceSelector` to explicitly include
increasing the poll interval (e.g., from 5s to a higher value) as an alternative
to sharding or raising resource requests/limits, and mention the tradeoff that
longer intervals reduce CPU load but increase detection latency so operators can
choose the correct balance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/resource-profiling.md`:
- Around line 39-43: The recommendations section for Sentinel capacity omits
increasing the poll interval as a mitigation; update the text in the
"Recommendations" paragraph that currently references defaults (CPU 100m/500m,
memory 128Mi/512Mi) and `resourceSelector` to explicitly include increasing the
poll interval (e.g., from 5s to a higher value) as an alternative to sharding or
raising resource requests/limits, and mention the tradeoff that longer intervals
reduce CPU load but increase detection latency so operators can choose the
correct balance.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7ea245ab-61ef-4183-bdb9-b3e13ce41916

📥 Commits

Reviewing files that changed from the base of the PR and between 17bf6e4 and aacce8c.

📒 Files selected for processing (1)
  • docs/resource-profiling.md

at 1000 clusters, add multi-instance deployment
link
@pnguyen44 pnguyen44 requested a review from rafabene March 27, 2026 20:55
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/resource-profiling.md (1)

41-45: Add explicit poll-interval option for cycle-time saturation scenario.

Line 39 identifies that at 5000 clusters, "cycle time exceeds the interval, so Sentinel runs back-to-back with no idle time." However, the recommendations on line 43 mention "increase resource requests and limits or shard" but omit the third mitigation option: adjusting the poll interval.

For the 5000-cluster saturation case, increasing resource limits won't resolve the cycle-time bottleneck (CPU is already well under the 500m limit at peak 172m). The viable options are:

  1. Increase poll interval (e.g., from 5s to 10s) to allow cycles to complete with idle time
  2. Shard workload across multiple instances using resourceSelector

This guidance was specifically requested in previous reviews. Consider adding the poll-interval option explicitly.

📝 Suggested documentation enhancement
-Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. At this scale the CPU request (100m) is tight with peaks exceeding it, but the 500m limit provides sufficient burst headroom. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing.
+Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. At this scale the CPU request (100m) is tight with peaks exceeding it, but the 500m limit provides sufficient burst headroom. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. At 5000 clusters, cycle-time saturation occurs; increasing the poll interval or sharding are the primary mitigations (resource increases alone won't resolve the cycle-time bottleneck). The results table above can be used to guide sizing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/resource-profiling.md` around lines 41 - 45, The Recommendations section
omits advising to increase the poll interval as a mitigation for the
5000-cluster "cycle time exceeds the interval" saturation case; update the
"Recommendations" paragraph (the text discussing CPU/memory defaults, 1000 vs
5000 clusters, and sharding with resourceSelector) to explicitly add a third
mitigation: increase the poll interval (e.g., from 5s to 10s) so cycles have
idle time, and mention this is the appropriate fix when CPU usage is below
limits but cycle time itself is the bottleneck; reference the 5000-cluster
saturation example and include the concrete suggestion to adjust poll interval
as an alternative to changing resource requests or sharding with
resourceSelector.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/resource-profiling.md`:
- Around line 41-45: The Recommendations section omits advising to increase the
poll interval as a mitigation for the 5000-cluster "cycle time exceeds the
interval" saturation case; update the "Recommendations" paragraph (the text
discussing CPU/memory defaults, 1000 vs 5000 clusters, and sharding with
resourceSelector) to explicitly add a third mitigation: increase the poll
interval (e.g., from 5s to 10s) so cycles have idle time, and mention this is
the appropriate fix when CPU usage is below limits but cycle time itself is the
bottleneck; reference the 5000-cluster saturation example and include the
concrete suggestion to adjust poll interval as an alternative to changing
resource requests or sharding with resourceSelector.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a43d00b1-3874-4465-898f-d8e2d4bbc219

📥 Commits

Reviewing files that changed from the base of the PR and between aacce8c and 30ee34a.

📒 Files selected for processing (1)
  • docs/resource-profiling.md

@rafabene
Copy link
Copy Markdown
Contributor

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants