Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an SSL-failure recovery path to the LLM client by refreshing TEE registry metadata/TLS configuration and retrying the request once when SSL-related errors occur.
Changes:
- Refactors TEE/TLS initialization into refreshable helpers and introduces a one-time SSL retry mechanism for non-streaming requests.
- Adds SSL-aware retry behavior for streaming chat setup (retry only if the stream hasn’t yielded any chunks yet).
- Updates example scripts to request a larger default OPG approval amount.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/opengradient/client/llm.py |
Adds TEE/TLS refresh + HTTP client rebuild and uses it to retry once on SSL-related failures for chat/completion (and stream setup). |
examples/llm_chat_streaming.py |
Increases example Permit2 approval amount. |
examples/llm_chat.py |
Increases example Permit2 approval amount. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
examples/llm_chat_streaming.py
Outdated
| async def main(): | ||
| llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY")) | ||
| llm.ensure_opg_approval(opg_amount=0.1) | ||
| llm.ensure_opg_approval(opg_amount=1) |
There was a problem hiding this comment.
This example increases the default approval amount from 0.1 to 1 OPG, which may be an unexpected on-chain spend for someone running the example. Consider keeping the approval amount minimal (or explaining the rationale in a comment) to avoid surprising users.
examples/llm_chat.py
Outdated
| async def main(): | ||
| llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY")) | ||
| llm.ensure_opg_approval(opg_amount=0.1) | ||
| llm.ensure_opg_approval(opg_amount=1) |
There was a problem hiding this comment.
This example increases the default approval amount from 0.1 to 1 OPG, which may be an unexpected on-chain spend for someone running the example. Consider keeping the approval amount minimal (or explaining the rationale in a comment) to avoid surprising users.
src/opengradient/client/llm.py
Outdated
| self._tee_registry_address = tee_registry_address | ||
| self._llm_server_url = llm_server_url | ||
| self._reset_lock = threading.Lock() | ||
|
|
||
| self._refresh_tee_config() | ||
| self._init_x402_stack() |
There was a problem hiding this comment.
Using a blocking threading.Lock inside async methods can stall the event loop if contention occurs (e.g., multiple tasks hit the SSL-retry path). Consider using an asyncio-compatible lock (asyncio.Lock/anyio.Lock) or otherwise ensuring the critical section cannot block the loop.
src/opengradient/client/llm.py
Outdated
| old_http_client = self._http_client | ||
| self._refresh_tee_config() | ||
| self._init_x402_stack() | ||
|
|
||
| try: | ||
| await old_http_client.aclose() | ||
| except Exception: | ||
| logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True) | ||
|
|
There was a problem hiding this comment.
_refresh_tee_and_reset() swaps out self._http_client and then closes the previous client. If other coroutines are concurrently using the old client (especially streaming), closing it here can terminate in-flight requests unexpectedly. Consider avoiding closing a shared client while it may be in use (e.g., serialize all requests during reset, use per-request clients, or defer cleanup until you can guarantee no active users).
| old_http_client = self._http_client | |
| self._refresh_tee_config() | |
| self._init_x402_stack() | |
| try: | |
| await old_http_client.aclose() | |
| except Exception: | |
| logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True) | |
| self._refresh_tee_config() | |
| self._init_x402_stack() |
src/opengradient/client/llm.py
Outdated
| async def _call_with_ssl_retry( | ||
| self, | ||
| operation_name: str, | ||
| call: Callable[[], Awaitable[T]], | ||
| ) -> T: | ||
| """Retry once with fresh TEE/TLS state when the failure is SSL-related.""" | ||
| try: | ||
| return await call() | ||
| except Exception as exc: | ||
| if not self._has_ssl_cause(exc): | ||
| raise | ||
| logger.warning( | ||
| "SSL failure during %s; refreshing TEE and retrying once: %s", | ||
| operation_name, | ||
| exc, | ||
| ) | ||
| await self._refresh_tee_and_reset() | ||
| return await call() |
There was a problem hiding this comment.
The new SSL-retry/reset behavior isn’t covered by unit tests. Adding tests that simulate an SSL-related exception on the first request and assert that the TEE config is refreshed and the call is retried (including the streaming path) would help prevent regressions.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/opengradient/client/llm.py
Outdated
| old_http_client = self._http_client | ||
| self._refresh_tee_config() | ||
| self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify) | ||
|
|
||
| try: | ||
| await old_http_client.aclose() | ||
| except Exception: | ||
| logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True) | ||
|
|
There was a problem hiding this comment.
_refresh_tee_and_reset() swaps the shared HTTP client and then closes the previous client. If the same LLM instance is used concurrently, other in-flight requests may still be using the old client instance and can fail when it gets closed. Consider preventing refresh while requests are in flight (e.g., by coordinating refresh with request execution) or delaying closure until it’s safe.
| old_http_client = self._http_client | |
| self._refresh_tee_config() | |
| self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify) | |
| try: | |
| await old_http_client.aclose() | |
| except Exception: | |
| logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True) | |
| self._refresh_tee_config() | |
| self._http_client = x402HttpxClient(self._x402_client, verify=self._tls_verify) |
| llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY")) | ||
| llm.ensure_opg_approval(opg_amount=0.1) | ||
| llm.ensure_opg_approval(opg_amount=1) | ||
|
|
There was a problem hiding this comment.
This example increases opg_amount from 0.1 to 1, which is unrelated to the PR’s stated goal (“ssl retry”) and materially changes the on-chain approval users will be prompted to grant. If this isn’t intentional for the SSL retry work, consider reverting or explaining the rationale in the PR description.
| llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY")) | ||
| llm.ensure_opg_approval(opg_amount=0.1) | ||
| llm.ensure_opg_approval(opg_amount=1) | ||
|
|
There was a problem hiding this comment.
This example increases opg_amount from 0.1 to 1, which is unrelated to the PR’s stated goal (“ssl retry”) and materially changes the on-chain approval users will be prompted to grant. If this isn’t intentional for the SSL retry work, consider reverting or explaining the rationale in the PR description.
| self._rpc_url = rpc_url | ||
| self._tee_registry_address = tee_registry_address | ||
| self._llm_server_url = llm_server_url | ||
| self._reset_lock = threading.Lock() | ||
|
|
There was a problem hiding this comment.
_refresh_tee_and_reset() uses a threading.Lock() inside async code. If two coroutines hit the refresh path concurrently, the second will block the event loop while waiting for the lock, potentially stalling unrelated tasks. Consider switching to an async-friendly lock (e.g., an asyncio.Lock) or using a non-blocking/lazily-created async lock so contention yields control back to the loop.
There was a problem hiding this comment.
this is used when we refetch from tee registry
examples/llm_chat.py
Outdated
| async def main(): | ||
| llm = og.LLM(private_key=os.environ.get("OG_PRIVATE_KEY")) | ||
| llm.ensure_opg_approval(opg_amount=0.1) | ||
| llm.ensure_opg_approval(opg_amount=1) |
There was a problem hiding this comment.
this will break most testers who can only get 0.1 OPG, do we need to bump it?
Uh oh!
There was an error while loading. Please reload this page.