Skip to content

HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89

Open
ldornele wants to merge 9 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-762
Open

HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89
ldornele wants to merge 9 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-762

Conversation

@ldornele
Copy link
Copy Markdown
Contributor

@ldornele ldornele commented Mar 21, 2026

Motivation

  • Standards Compliance: Use official OpenTelemetry environment variables instead of custom ones
  • Cross-Component Consistency: Implement TRACING_ENABLED (Tracing standard) for unified control across hyperfleet-api, hyperfleet-sentinel, and adapters
  • Ecosystem Integration: Enable seamless integration with OpenTelemetry Operator, collectors, and other OTEL-compatible tools
  • Simplified Configuration: Reduce confusion by using industry-standard variable names

Changes

  • Add TRACING_ENABLED as primary toggle (Tracing standard override)
  • Replace HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE with OTEL_TRACES_SAMPLER_ARG
  • Implement standard OTEL_* environment variables:
    • OTEL_SERVICE_NAME
    • OTEL_EXPORTER_OTLP_ENDPOINT
    • OTEL_EXPORTER_OTLP_PROTOCOL
    • OTEL_TRACES_SAMPLER
    • OTEL_TRACES_SAMPLER_ARG
    • OTEL_RESOURCE_ATTRIBUTES
  • Update servecmd to respect TRACING_ENABLED > config precedence
  • Update logging.md with OpenTelemetry environment variables table
  • Update config.md with OpenTelemetry configuration section

Variable Migration

Old (Removed) New (Standard)
HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE OTEL_TRACES_SAMPLER_ARG
N/A OTEL_SERVICE_NAME
N/A OTEL_EXPORTER_OTLP_ENDPOINT
N/A OTEL_EXPORTER_OTLP_PROTOCOL
N/A OTEL_TRACES_SAMPLER
N/A OTEL_RESOURCE_ATTRIBUTES

Backward Compatibility:

  • HYPERFLEET_LOGGING_OTEL_ENABLED still works via Viper
  • Existing config files with logging.otel.enabled: true continue to work
  • TRACING_ENABLED takes precedence when set

Variable Precedence

TRACING_ENABLED (Tracing standard)

HYPERFLEET_LOGGING_OTEL_ENABLED (Viper env var)

config.yaml: logging.otel.enabled

Default (false)

Documentation

Summary by CodeRabbit

  • New Features

    • Support for TRACING_ENABLED plus standard OTEL_* environment variables to control OpenTelemetry tracing; tracing default enabled in deployments.
  • Changes

    • Sampling is now driven by OTEL_TRACES_SAMPLER/OTEL_TRACES_SAMPLER_ARG; removed default sampling-rate from config/CLI/Helm.
    • Service name derives from OTEL_SERVICE_NAME and service metadata is logged.
    • Deprecated HyperFleet-specific OTel env vars removed.
  • Documentation

    • Expanded OpenTelemetry guidance, precedence rules, and deployment checklist.
  • Tests

    • Added telemetry tests for exporters, sampler behavior, and shutdown.

@openshift-ci openshift-ci bot requested review from jsell-rh and pnguyen44 March 21, 2026 20:25
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vkareh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 21, 2026

Hi @ldornele. Thanks for your PR.

I'm waiting for a openshift-hyperfleet member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 21, 2026

Walkthrough

The PR removes the OpenTelemetry sampling_rate configuration surface (code, CLI flag, Helm value, and tests); changes default OTel enabled to true in Helm/config; introduces TRACING_ENABLED environment variable which, when set and valid, overrides configured tracing enablement (with deprecated HYPERFLEET_LOGGING_OTEL_ENABLED detection); updates startup to resolve tracingEnabled from env/config and pass it into API server routing; reworks telemetry initialization to use OTEL_* env vars for service name, exporter (OTLP HTTP/gRPC or stdout), and sampler selection; adds telemetry unit tests and logging field constants; and updates docs, Helm templates, flags, loader bindings, and go.mod deps.

Sequence Diagram(s)

sequenceDiagram
  participant Startup as Server Startup
  participant Env as Environment (TRACING_ENABLED, OTEL_*)
  participant Config as Application Config
  participant Telemetry as telemetry.InitTraceProvider
  participant Exporter as OTLP/Stdout Exporter

  Startup->>Env: Read TRACING_ENABLED, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_*, OTEL_TRACES_*
  Startup->>Config: Load config.logging.otel.enabled
  Startup->>Startup: Resolve tracingEnabled (TRACING_ENABLED overrides config if set)
  Startup->>Config: Optionally set resolved tracingEnabled into runtime config
  Startup->>Telemetry: InitTraceProvider(ctx, serviceName, serviceVersion)
  Telemetry->>Env: Read OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_TRACES_SAMPLER_ARG, OTEL_RESOURCE_ATTRIBUTES
  Telemetry->>Exporter: Create OTLP exporter (HTTP or gRPC) if endpoint set, else stdouttrace exporter
  Telemetry->>Telemetry: Parse and configure sampler from OTEL_TRACES_SAMPLER / ARG
  Telemetry->>Telemetry: Build resource (including FromEnv) and set TextMapPropagator (TraceContext + Baggage)
  Telemetry->>Startup: Return TracerProvider (or error)
Loading
sequenceDiagram
  participant Server as API Server
  participant Router as Routes Setup
  participant Middleware as OTel Middleware

  Server->>Router: NewAPIServer(tracingEnabled)
  Router->>Router: routes(tracingEnabled)
  alt tracingEnabled == true
    Router->>Middleware: Register OpenTelemetry middleware
  else tracingEnabled == false
    Router->>Router: Skip OTel middleware
  end
  Router->>Server: Return configured router
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly refers to the primary change in the PR: standardizing OpenTelemetry configuration to use standard environment variables instead of HyperFleet-specific ones.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
pkg/logger/fields.go (1)

45-45: Inconsistent field naming convention.

FieldTracingEnabled uses uppercase "TRACING_ENABLED" while all other field constants in this file use snake_case (e.g., FieldOTelEnabled = "otel_enabled", FieldSamplingRate = "sampling_rate"). This inconsistency could cause confusion in log queries.

♻️ Suggested fix for consistency
-	FieldTracingEnabled   = "TRACING_ENABLED"
+	FieldTracingEnabled   = "tracing_enabled"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/logger/fields.go` at line 45, FieldTracingEnabled uses uppercase
"TRACING_ENABLED" which is inconsistent with the snake_case convention used by
other constants (e.g., FieldOTelEnabled, FieldSamplingRate); rename or change
the constant value to snake_case—replace the value of FieldTracingEnabled with
"tracing_enabled" (or rename the identifier to match project convention) so it
matches the pattern used by other fields in pkg/logger/fields.go.
pkg/telemetry/otel_test.go (2)

140-145: Minor: Error message omits the actual error.

The shutdown error message doesn't include the error value, making debugging harder.

♻️ Suggested fix
 			defer func(ctx context.Context, tp *trace.TracerProvider) {
 				err := Shutdown(ctx, tp)
 				if err != nil {
-					t.Errorf("Failed to shutdown trace provider")
+					t.Errorf("Failed to shutdown trace provider: %v", err)
 				}
 			}(ctx, tp)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 140 - 145, The defer's error path in
the anonymous cleanup function calls t.Errorf("Failed to shutdown trace
provider") without including the returned error; update the t.Errorf call inside
the deferred function that wraps Shutdown(ctx, tp) to include the err value
(e.g., t.Errorf("Failed to shutdown trace provider: %v", err)) so the Shutdown
error from trace.TracerProvider is logged for debugging.

151-160: Incomplete sampling assertion for expectedSample=true cases.

When expectedSample is true, the test only verifies span.SpanContext().IsValid(). This doesn't confirm that the span was actually sampled—a span context can be valid but not sampled. Consider also checking IsSampled() for consistency with the expectedSample=false branch.

♻️ Suggested fix
 			if tt.expectedSample {
-				if !span.SpanContext().IsValid() {
-					t.Error("Expected valid span context for sampling=true")
+				if !span.SpanContext().IsValid() || !span.SpanContext().TraceFlags().IsSampled() {
+					t.Error("Expected valid and sampled span context for sampling=true")
 				}
 			} else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 151 - 160, The test's sampling
assertion is incomplete: when expectedSample is true you only check
span.SpanContext().IsValid(), which doesn't ensure the span was actually
sampled; update the true-branch to also assert
span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch check),
so when expectedSample is true you fail the test if TraceFlags().IsSampled() is
false while keeping the existing IsValid() check for span.SpanContext().
docs/logging.md (1)

377-399: Code snippet differs from actual implementation.

The initialization snippet at lines 377-399 simplifies error handling compared to the actual implementation in cmd/hyperfleet-api/servecmd/cmd.go. Notably:

  • Line 380 ignores the strconv.ParseBool error, whereas the actual code logs a warning and falls back to config.
  • Line 395 uses defer tp.Shutdown(...) directly, whereas the actual code uses telemetry.Shutdown() with proper timeout context.

This is acceptable for a conceptual illustration, but consider adding a note that this is a simplified example.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/logging.md` around lines 377 - 399, The snippet in the docs simplifies
error handling compared to the real implementation: update the text to note it’s
a simplified example and call out the behavioral differences — specifically that
strconv.ParseBool errors are logged via logger.WithError and fall back to
environments.Environment().Config.Logging.OTel.Enabled instead of being ignored,
and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context
rather than deferring tp.Shutdown(...) directly after
telemetry.InitTraceProvider; mention the symbols strconv.ParseBool,
logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and
telemetry.Shutdown so readers can map the doc snippet to the actual
implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/config.md`:
- Line 691: Update the checklist entry that currently lists
HYPERFLEET_CLUSTER_ADAPTERS and HYPERFLEET_NODEPOOL_ADAPTERS to use the
canonical environment variable names HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER and
HYPERFLEET_ADAPTERS_REQUIRED_NODEPOOL so it matches the Complete Reference (the
entries referenced at lines 466-467); locate the checklist line in
docs/config.md and replace the two variable names accordingly to ensure
documentation consistency.

---

Nitpick comments:
In `@docs/logging.md`:
- Around line 377-399: The snippet in the docs simplifies error handling
compared to the real implementation: update the text to note it’s a simplified
example and call out the behavioral differences — specifically that
strconv.ParseBool errors are logged via logger.WithError and fall back to
environments.Environment().Config.Logging.OTel.Enabled instead of being ignored,
and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context
rather than deferring tp.Shutdown(...) directly after
telemetry.InitTraceProvider; mention the symbols strconv.ParseBool,
logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and
telemetry.Shutdown so readers can map the doc snippet to the actual
implementation.

In `@pkg/logger/fields.go`:
- Line 45: FieldTracingEnabled uses uppercase "TRACING_ENABLED" which is
inconsistent with the snake_case convention used by other constants (e.g.,
FieldOTelEnabled, FieldSamplingRate); rename or change the constant value to
snake_case—replace the value of FieldTracingEnabled with "tracing_enabled" (or
rename the identifier to match project convention) so it matches the pattern
used by other fields in pkg/logger/fields.go.

In `@pkg/telemetry/otel_test.go`:
- Around line 140-145: The defer's error path in the anonymous cleanup function
calls t.Errorf("Failed to shutdown trace provider") without including the
returned error; update the t.Errorf call inside the deferred function that wraps
Shutdown(ctx, tp) to include the err value (e.g., t.Errorf("Failed to shutdown
trace provider: %v", err)) so the Shutdown error from trace.TracerProvider is
logged for debugging.
- Around line 151-160: The test's sampling assertion is incomplete: when
expectedSample is true you only check span.SpanContext().IsValid(), which
doesn't ensure the span was actually sampled; update the true-branch to also
assert span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch
check), so when expectedSample is true you fail the test if
TraceFlags().IsSampled() is false while keeping the existing IsValid() check for
span.SpanContext().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9550d68f-106e-4b28-81a8-9e6a21fdcecc

📥 Commits

Reviewing files that changed from the base of the PR and between 34ae499 and 6e087ee.

📒 Files selected for processing (12)
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/config.md
  • docs/logging.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/loader_test.go
  • pkg/config/logging.go
  • pkg/config/logging_test.go
  • pkg/logger/fields.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
💤 Files with no reviewable changes (3)
  • pkg/config/loader.go
  • pkg/config/logging_test.go
  • pkg/config/loader_test.go

@ldornele
Copy link
Copy Markdown
Contributor Author

/ok-to-test

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 21, 2026

@ldornele: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/telemetry/otel_test.go (1)

38-73: Consider adding t.Parallel() isolation or using t.Setenv.

The OTLP exporter test manipulates global environment variables. While tests in this file run sequentially by default, using t.Setenv (Go 1.17+) would automatically handle cleanup and prevent interference if parallelism is enabled later.

// Instead of:
os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "...")
defer os.Unsetenv("OTEL_EXPORTER_OTLP_ENDPOINT")

// Consider:
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317")

This is a minor improvement for maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 38 - 73, The test
TestInitTraceProvider_OTLPExporter mutates global env vars with
os.Setenv/os.Unsetenv; replace that pattern with
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to
automatically handle cleanup and avoid cross-test interference, and optionally
add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want
the test to run concurrently; ensure references to InitTraceProvider and
Shutdown remain unchanged and that the tracer assertions still follow the
t.Setenv call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 38-73: The test TestInitTraceProvider_OTLPExporter mutates global
env vars with os.Setenv/os.Unsetenv; replace that pattern with
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to
automatically handle cleanup and avoid cross-test interference, and optionally
add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want
the test to run concurrently; ensure references to InitTraceProvider and
Shutdown remain unchanged and that the tracer assertions still follow the
t.Setenv call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7e5e533-cfbb-43f3-8928-71044a158f42

📥 Commits

Reviewing files that changed from the base of the PR and between 6e087ee and e95d947.

📒 Files selected for processing (5)
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/config.md
  • docs/logging.md
  • pkg/logger/fields.go
  • pkg/telemetry/otel_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/logger/fields.go
  • docs/logging.md

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 52-62: The test currently calls Shutdown(ctx, tp) only after
assertions, which can leak the trace provider if an assertion fails; immediately
defer the cleanup right after creating the trace provider (tp) so Shutdown(ctx,
tp) is always executed—i.e., place a defer func() { _ = Shutdown(ctx, tp) }()
(or check/handle error inside defer) immediately after tp is
initialized/returned and before calling otel.Tracer("test") so the tracer, tp,
Shutdown, and ctx references are preserved and cleanup always runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3955823-9df9-4051-9ce6-ee7b96f04b8f

📥 Commits

Reviewing files that changed from the base of the PR and between e95d947 and b804696.

📒 Files selected for processing (1)
  • pkg/telemetry/otel_test.go

@tirthct
Copy link
Copy Markdown

tirthct commented Mar 23, 2026

/ok-to-test

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Around line 35-41: Update the vulnerable dependency
go.opentelemetry.io/otel/sdk from v1.38.0 to at least v1.40.0 in go.mod to
mitigate CVE-2026-24051; specifically change the module version string for
go.opentelemetry.io/otel/sdk to v1.40.0 (or later) and run `go mod tidy`/`go
get` to refresh the lockfile and ensure the build picks up the patched release
that uses the absolute /usr/sbin/ioreg path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 105558a0-d208-456d-90df-dd5e89435b26

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd7790 and 9d50749.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • charts/templates/configmap.yaml
  • charts/values.yaml
  • go.mod
💤 Files with no reviewable changes (1)
  • charts/templates/configmap.yaml
✅ Files skipped from review due to trivial changes (1)
  • charts/values.yaml

// Determine if tracing is enabled
// Precedence: TRACING_ENABLED (tracing standard) > config (env/flags) > default
var tracingEnabled bool
if tracingEnv := os.Getenv("TRACING_ENABLED"); tracingEnv != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we align on using a single variable for enabling tracing?

I see the standard says TRACING_ENABLED and this may come from before standarising environment variables with prefix HYPERFLEET_

Then, the docs mention HYPERFLEET_LOGGING_OTEL_ENABLED

I think we should consolidate in a single one and keep it consistent across repositories.

AFAIK OTEL_* variables do not have the HYPERFLEET_ prefix since they are used directly by the OTEL libraries, but any other variable that we introduce should be prefixed by our standard convention.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I’ll remove HYPERFLEET_LOGGING_OTEL_ENABLED then. I’ll keep the OTEL_ variables* for consistency with hyperfleet-sentinel. Does that make sense?

Copy link
Copy Markdown
Contributor

@rh-amarin rh-amarin Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to clarify, I meant to use HYPERFLEET_TRACING_ENABLED instead of simple TRACING_ENABLED

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HYPERFLEET_TRACING_ENABLED is not compliant with the tracing standards — we should use TRACING_ENABLED instead. See: tracing standards

Copy link
Copy Markdown
Collaborator

@xueli181114 xueli181114 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the standards need to be updated to env with HYPERFLEET_ prefix. It comes from configuration.md

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. In that case, should we update the tracing standard first so it aligns with the configuration standard as well?

Copy link
Copy Markdown
Contributor

@rafabene rafabene Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a follow-up ticket for this: https://redhat.atlassian.net/browse/HYPERFLEET-816

Comment on lines +98 to +103
serviceName := "hyperfleet-api"
if svcName := os.Getenv("OTEL_SERVICE_NAME"); svcName != "" {
serviceName = svcName
}

traceProvider, err := telemetry.InitTraceProvider(ctx, serviceName, api.Version)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sentinel and adapter have a name field
I wonder if we should add it to the API to avoid having this constant "hyperfleet-api" here

Then, here we are reading a variable that is otel specific.... I wonder if that should be done inside the telemetry.InitTraceProvider function, so main can become completely unaware of this setting

ldornele and others added 2 commits March 23, 2026 18:31
Updates OpenTelemetry SDK and exporters from v1.38.0 to v1.40.0, addressing
potential security vulnerabilities and fixing CI test-integration failures
caused by missing dependency entries in go.sum.

Key updates:
- go.opentelemetry.io/otel: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/sdk: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/otlp/*: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/stdout/stdouttrace: v1.38.0 → v1.40.0
- google.golang.org/grpc: v1.75.0 → v1.78.0

Includes transitive dependency updates for protobuf, grpc-gateway, and
golang.org/x/{crypto,net,sys,text}.

All 551 unit tests pass.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@xueli181114 xueli181114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Developer review: inline follow-ups on consistency and operator-facing correctness.

Copy link
Copy Markdown
Collaborator

@xueli181114 xueli181114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good direction — moving to standard OTEL env vars is the right call. A few things I noticed beyond what rh-amarin and rafabene already flagged.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 151: Update the transitive gRPC dependency to a patched release by
bumping google.golang.org/grpc from v1.78.0 to a secure version (e.g., the
latest patched 1.8x+ release) in go.mod and run go mod tidy to lock it; ensure
the change resolves the CVE affecting :path header validation so the
OpenTelemetry gRPC exporter (otlptracegrpc) usages in the trace
export/initialization code are using the upgraded grpc library at runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: feaf5d3b-45b3-4ade-b100-735d5a8bd5ea

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd7790 and 718aab4.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (15)
  • charts/templates/configmap.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/api_server.go
  • cmd/hyperfleet-api/server/routes.go
  • docs/config.md
  • docs/logging.md
  • go.mod
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/logging.go
  • pkg/config/logging_test.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
  • test/helper.go
💤 Files with no reviewable changes (3)
  • charts/templates/configmap.yaml
  • pkg/config/loader.go
  • pkg/config/flags.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/telemetry/otel_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • charts/values.yaml
  • pkg/config/logging_test.go
  • docs/logging.md
  • pkg/config/logging.go
  • docs/config.md
  • pkg/telemetry/otel.go

golang.org/x/text v0.33.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20260128011058-8636f8732409 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409 // indirect
google.golang.org/grpc v1.78.0 // indirect
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify grpc vulnerability and check latest available version

echo "=== Current grpc version in go.mod ==="
rg -n 'google\.golang\.org/grpc' go.mod

echo ""
echo "=== Check for grpc usage in codebase ==="
rg -n 'google\.golang\.org/grpc' --type go -g '!go.mod' -g '!go.sum' | head -20

echo ""
echo "=== Check otlptracegrpc usage (creates gRPC connections) ==="
rg -n 'otlptracegrpc' --type go

echo ""
echo "=== Query OSV for grpc v1.78.0 advisories ==="
curl -s -X POST https://api.osv.dev/v1/query \
  -H "Content-Type: application/json" \
  -d '{"package": {"ecosystem": "Go", "name": "google.golang.org/grpc"}, "version": "1.78.0"}' | \
  python3 -c "import sys,json; d=json.load(sys.stdin); vulns=d.get('vulns',[]); print(f'Found {len(vulns)} vulnerability(ies)'); [print(f\"  - {v.get('id')}: {v.get('summary','N/A')}\") for v in vulns]"

Repository: openshift-hyperfleet/hyperfleet-api

Length of output: 776


Upgrade google.golang.org/grpc to address a critical authorization bypass vulnerability.

The current version v1.78.0 has a critical authorization bypass (CVE identifiers: GO-2026-4762, GHSA-p77j-4mvh-x3m3) involving missing leading slash validation in :path headers. This affects gRPC services with authorization interceptors.

The codebase uses gRPC through otlptracegrpc for OpenTelemetry trace export (see pkg/telemetry/otel.go lines 55, 59), making this vulnerability applicable. Upgrade to a patched version immediately.

🧰 Tools
🪛 OSV Scanner (2.3.5)

[CRITICAL] 151-151: google.golang.org/grpc 1.78.0: Authorization bypass in gRPC-Go via missing leading slash in :path in google.golang.org/grpc

(GO-2026-4762)


[CRITICAL] 151-151: google.golang.org/grpc 1.78.0: gRPC-Go has an authorization bypass via missing leading slash in :path

(GHSA-p77j-4mvh-x3m3)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 151, Update the transitive gRPC dependency to a patched
release by bumping google.golang.org/grpc from v1.78.0 to a secure version
(e.g., the latest patched 1.8x+ release) in go.mod and run go mod tidy to lock
it; ensure the change resolves the CVE affecting :path header validation so the
OpenTelemetry gRPC exporter (otlptracegrpc) usages in the trace
export/initialization code are using the upgraded grpc library at runtime.

logger.With(ctx,
"deprecated_variable", "HYPERFLEET_LOGGING_OTEL_ENABLED",
"replacement", "TRACING_ENABLED",
).Warn("HYPERFLEET_LOGGING_OTEL_ENABLED is deprecated and ignored. Please use TRACING_ENABLED instead.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says HYPERFLEET_LOGGING_OTEL_ENABLED still works via Viper, but the Viper
bindings for logging.otel.enabled were removed from loader.go (both bindAllEnvVars and
bindFlags). This means the env var is truly dead — it won't flow into the config struct.

The deprecation warning correctly says "ignored", but existing deployments that used
HYPERFLEET_LOGGING_OTEL_ENABLED=false to disable tracing will break silently, since the
default flipped to true.

Please, Update the PR description to clearly state HYPERFLEET_LOGGING_OTEL_ENABLED is no
longer supported.

@rafabene
Copy link
Copy Markdown
Contributor

The JIRA ticket HYPERFLEET-762 lists "CloudEvent Tracing Integration" as one of the missing
items, and the tracing standard requires trace context propagation via CloudEvents
(traceparent/tracestate extension attributes). This PR doesn't address that.

Is this intentionally deferred to a follow-up ticket? If so, consider noting it in the PR
description and creating the follow-up ticket so it's tracked.


| Property | Environment Variable | Type | Default | Description |
|----------|---------------------|------|---------|-------------|
| `logging.otel.enabled` | `TRACING_ENABLED` | bool | `true` | Enable OpenTelemetry tracing (HyperFleet standard) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update to use the HYPERFLEET_ prefix here.

}
}

func TestInitTraceProvider_OTLPExporter(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The http/protobuf exporter path (line 53 of otel.go) has no test coverage.
TestInitTraceProvider_OTLPExporter only tests the default gRPC case, and
TestInitTraceProvider_InvalidProtocol tests invalid values falling back to gRPC.

Consider adding a test case for http/protobuf:

func TestInitTraceProvider_OTLPHttpExporter(t *testing.T) {
      ctx := context.Background()
      t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4318")
      t.Setenv("OTEL_EXPORTER_OTLP_PROTOCOL", "http/protobuf")

      tp, err := InitTraceProvider(ctx, "test-service", "v1.0.0")
      if err != nil {
              t.Fatalf("Failed to initialize trace provider with HTTP OTLP: %v", err)
      }
      defer func() {
              if err := Shutdown(ctx, tp); err != nil {
                      t.Errorf("Failed to shutdown: %v", err)
              }
      }()
}

// - OTEL_TRACES_SAMPLER: sampler type (default: "parentbased_traceidratio")
// - OTEL_TRACES_SAMPLER_ARG: sampling rate 0.0-1.0 (default: 1.0)
// - OTEL_RESOURCE_ATTRIBUTES: additional resource attributes (k=v,k2=v2 format)
func InitTraceProvider(ctx context.Context, serviceName, serviceVersion string) (*trace.TracerProvider, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitTraceProvider is now ~80 lines with three distinct concerns: exporter creation, resource
setup, and sampler selection. Consider extracting helpers to reduce complexity:

func createExporter(ctx context.Context) (trace.SpanExporter, error) {
      // OTLP vs stdout logic
}

func selectSampler(ctx context.Context) trace.Sampler {
      // sampler type switch + parseSamplingRate
}

This would make InitTraceProvider a concise orchestrator (~30 lines) and make each piece
independently testable.

}
}

func TestInitTraceProvider_SamplerEnvironmentVariables(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each test calls InitTraceProvider which sets otel.SetTracerProvider(tp) and
otel.SetTextMapPropagator(...) globally. The global state is never reset between subtests,
which could cause subtle issues if tests are reordered or parallelized.

Consider adding cleanup in each subtest:

defer func() {
      if err := Shutdown(ctx, tp); err != nil {
              t.Errorf("Failed to shutdown: %v", err)
      }
      otel.SetTracerProvider(nil)
      otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator())
}()

Or wrap the reset in a test helper that all subtests share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants