ML Engineer Β· Founder
Building agentic systems β from reasoning loops to production backends.
Incoming MLE at Robinhood (Agentic AI) Β· M.S. CS (AI/ML) at Duke Β· B.Comp. CS at NUS
Currently interested in: agent evaluation harnesses, context engineering for long-running workflows, and what it actually takes to benchmark agents that make real-world decisions over time.
VYNN AI β Agentic financial analyst platform (sole engineer, ~500 users, 50K+ LOC)
LangGraph supervisor orchestrates 5 specialized agents for end-to-end equity research: data scraping β DCF modeling (6 sector strategies) β news intelligence β report generation β all in under 7 minutes. The hard part wasn't the LLM calls; it was making the numbers trustworthy. The recommendation engine uses a 3-layer architecture: deterministic math (RecommendationCalculator) β LLM narrative β regex-based validator that blocks publication if citation coverage drops below 95%. Built a custom 1,293-line Excel formula evaluator so the DCF workbook and downstream JSON stay perfectly consistent without requiring Excel. Nightly CI runs a golden-dataset regression suite across 100 QQQ companies and blocks deployment if valuations drift beyond threshold.
β stock-analyst (agent backend) Β· vynnai-web (platform frontend) Β· api-runner (API layer)
AutoCodeRover β Autonomous code repair agent Β· Core technology acquired by Sonar
Designed the Self-Fix Agent: when a patch fails to apply, an LLM-as-a-Judge diagnoses which pipeline stage (Context Retrieval or Patch Generation) caused the failure, generates corrective feedback, and replays from that stage β preserving upstream state via UUID-targeted responses. Also built a stateful replay mechanism so developers can inject feedback on any intermediate LLM response and trigger selective re-execution downstream. Result: 51.6% on SWE-bench Verified (up from 38.4%), 1.8Γ patch precision over next-best open-source agent.
β auto-code-rover (agent backend) Β· Jetbrains-IDE-Plugin (Kotlin, end-to-end)
ACR JetBrains Plugin β IDE-integrated autonomous repair
Built end-to-end in Kotlin. Three things I'm most proud of: (1) GumTree 3-way AST merge β when you've edited code while the agent is patching the same file remotely, the plugin reconciles baseline β your edits β agent's patch at the AST level, not text level. (2) PSI-based context enrichment β before sending a task to ACR, the plugin extracts symbol references, cursor history (last 10 positions), and open files to narrow the agent's search scope. (3) Embedded SonarLint β runs static analysis locally, then lets you one-click send any issue to ACR for autonomous fixing.
LUMINA β Multi-agent citation screening for medical systematic reviews (first author)
Four-agent pipeline: classifier triage β PICOS-guided Chain-of-Thought screening β LLM-as-a-Judge reviewer β self-correction agent. Evaluated across 15 SRMAs (~150K citations from BMJ, JAMA, Lancet). 98.2% sensitivity (10 of 15 at perfect 100%) with 35Γ fewer missed studies vs. prior baselines, at $0.007/article.
- Agent harness design β VYNN's golden-dataset regression suite and ACR's SWE-bench eval loop taught me the same lesson: the harness that catches agent regressions matters more than the agent itself. I'm interested in building eval infrastructure that can score long-running, multi-step agents where "correct" isn't a single number.
- Context engineering β Most agent failures I've debugged trace back to what the agent didn't know, not what it reasoned poorly about. PSI-based enrichment in the ACR plugin, MCP self-retrieval in VYNN, 33 externalized prompt templates β these are all different bets on the same problem: giving agents the right context at the right time.
- The gap between demo benchmarks and production trust β An agent that scores 51.6% on SWE-bench still fails half the time. VYNN's 3-layer recommendation validator exists because "usually right" isn't good enough for financial decisions. I'm drawn to the engineering that makes agents trustworthy enough to run unsupervised.
Last updated: Mar 2026


