Skip to content

Refetch ssl from registry#204

Merged
adambalogh merged 9 commits intomainfrom
ani/ssl-changes
Mar 26, 2026
Merged

Refetch ssl from registry#204
adambalogh merged 9 commits intomainfrom
ani/ssl-changes

Conversation

@dixitaniket
Copy link
Collaborator

@dixitaniket dixitaniket commented Mar 25, 2026

  • bumps opg approval to 1 in examples
  • ssl refresh on errors (due to tee updates)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an SSL-failure recovery path to the LLM client by refreshing TEE registry metadata/TLS configuration and retrying the request once when SSL-related errors occur.

Changes:

  • Refactors TEE/TLS initialization into refreshable helpers and introduces a one-time SSL retry mechanism for non-streaming requests.
  • Adds SSL-aware retry behavior for streaming chat setup (retry only if the stream hasn’t yielded any chunks yet).
  • Updates example scripts to request a larger default OPG approval amount.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/opengradient/client/llm.py Adds TEE/TLS refresh + HTTP client rebuild and uses it to retry once on SSL-related failures for chat/completion (and stream setup).
examples/llm_chat_streaming.py Increases example Permit2 approval amount.
examples/llm_chat.py Increases example Permit2 approval amount.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

async def main():
llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY"))
llm.ensure_opg_approval(opg_amount=0.1)
llm.ensure_opg_approval(opg_amount=1)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example increases the default approval amount from 0.1 to 1 OPG, which may be an unexpected on-chain spend for someone running the example. Consider keeping the approval amount minimal (or explaining the rationale in a comment) to avoid surprising users.

Copilot uses AI. Check for mistakes.
async def main():
llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY"))
llm.ensure_opg_approval(opg_amount=0.1)
llm.ensure_opg_approval(opg_amount=1)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example increases the default approval amount from 0.1 to 1 OPG, which may be an unexpected on-chain spend for someone running the example. Consider keeping the approval amount minimal (or explaining the rationale in a comment) to avoid surprising users.

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +105
self._tee_registry_address = tee_registry_address
self._llm_server_url = llm_server_url
self._reset_lock = threading.Lock()

self._refresh_tee_config()
self._init_x402_stack()
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a blocking threading.Lock inside async methods can stall the event loop if contention occurs (e.g., multiple tasks hit the SSL-retry path). Consider using an asyncio-compatible lock (asyncio.Lock/anyio.Lock) or otherwise ensuring the critical section cannot block the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +149
old_http_client = self._http_client
self._refresh_tee_config()
self._init_x402_stack()

try:
await old_http_client.aclose()
except Exception:
logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_refresh_tee_and_reset() swaps out self._http_client and then closes the previous client. If other coroutines are concurrently using the old client (especially streaming), closing it here can terminate in-flight requests unexpectedly. Consider avoiding closing a shared client while it may be in use (e.g., serialize all requests during reset, use per-request clients, or defer cleanup until you can guarantee no active users).

Suggested change
old_http_client = self._http_client
self._refresh_tee_config()
self._init_x402_stack()
try:
await old_http_client.aclose()
except Exception:
logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True)
self._refresh_tee_config()
self._init_x402_stack()

Copilot uses AI. Check for mistakes.
Comment on lines +150 to +167
async def _call_with_ssl_retry(
self,
operation_name: str,
call: Callable[[], Awaitable[T]],
) -> T:
"""Retry once with fresh TEE/TLS state when the failure is SSL-related."""
try:
return await call()
except Exception as exc:
if not self._has_ssl_cause(exc):
raise
logger.warning(
"SSL failure during %s; refreshing TEE and retrying once: %s",
operation_name,
exc,
)
await self._refresh_tee_and_reset()
return await call()
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new SSL-retry/reset behavior isn’t covered by unit tests. Adding tests that simulate an SSL-related exception on the first request and assert that the TEE config is refreshed and the call is retried (including the streaming path) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +147 to +155
old_http_client = self._http_client
self._refresh_tee_config()
self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify)

try:
await old_http_client.aclose()
except Exception:
logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_refresh_tee_and_reset() swaps the shared HTTP client and then closes the previous client. If the same LLM instance is used concurrently, other in-flight requests may still be using the old client instance and can fail when it gets closed. Consider preventing refresh while requests are in flight (e.g., by coordinating refresh with request execution) or delaying closure until it’s safe.

Suggested change
old_http_client = self._http_client
self._refresh_tee_config()
self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify)
try:
await old_http_client.aclose()
except Exception:
logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True)
self._refresh_tee_config()
self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify)

Copilot uses AI. Check for mistakes.
Comment on lines 12 to 14
llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY"))
llm.ensure_opg_approval(opg_amount=0.1)
llm.ensure_opg_approval(opg_amount=1)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example increases opg_amount from 0.1 to 1, which is unrelated to the PR’s stated goal (“ssl retry”) and materially changes the on-chain approval users will be prompted to grant. If this isn’t intentional for the SSL retry work, consider reverting or explaining the rationale in the PR description.

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 10
llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY"))
llm.ensure_opg_approval(opg_amount=0.1)
llm.ensure_opg_approval(opg_amount=1)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example increases opg_amount from 0.1 to 1, which is unrelated to the PR’s stated goal (“ssl retry”) and materially changes the on-chain approval users will be prompted to grant. If this isn’t intentional for the SSL retry work, consider reverting or explaining the rationale in the PR description.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
self._rpc_url = rpc_url
self._tee_registry_address = tee_registry_address
self._llm_server_url = llm_server_url
self._reset_lock = threading.Lock()

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_refresh_tee_and_reset() uses a threading.Lock() inside async code. If two coroutines hit the refresh path concurrently, the second will block the event loop while waiting for the lock, potentially stalling unrelated tasks. Consider switching to an async-friendly lock (e.g., an asyncio.Lock) or using a non-blocking/lazily-created async lock so contention yields control back to the loop.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used when we refetch from tee registry

@dixitaniket dixitaniket changed the title [WIP] ssl retry Refetch ssl from registry Mar 25, 2026
async def main():
llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY"))
llm.ensure_opg_approval(opg_amount=0.1)
llm.ensure_opg_approval(opg_amount=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break most testers who can only get 0.1 OPG, do we need to bump it?

@adambalogh adambalogh merged commit 6ffa8e2 into main Mar 26, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants