Skip to content
View SergiuDeveloper's full-sized avatar

Highlights

  • Pro

Block or report SergiuDeveloper

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SergiuDeveloper/README.md

Sergiu Nistor

AI Research engineer focused on deep learning and scalable ML infrastructure. I'm drawn to the unsolved parts of the field, where the science is still being written and the engineering hasn't caught up yet, and building the systems that make those ideas real.

Website

LinkedIn

Calendly

Substack


Projects

Project Description
yoro-full-pretraining Novel LLM architecture (YORO) where the reasoning block runs once at prefill and is reused across all token generation steps - O(1) reasoning cost vs O(T). Pretrained from scratch on 10B tokens with 8×H100s and DeepSpeed ZeRO
yoro-finetuning Fine-tuning stage for YORO: freeze the reasoning block, train lightweight adaptation/compensation/concatenation subnets via knowledge distillation with temperature-scaled soft labels
llm-layer-prefetch Pipelined layer-streaming system enabling full LLM inference at a fraction of the model's VRAM footprint - disk, CPU, and GPU transfers overlapped in parallel
cuda-kernel-verifier Runtime correctness checker for custom CUDA/Triton kernels - decorator-based, outlier-biased sampling, zero training graph impact
self-attention-cuda-kernel-comparison Benchmarks of hand-written CUDA C, Numba, and Triton self-attention kernels vs PyTorch SDPA across sequence lengths, batch sizes, and head dims
mojo-tensor GPU-accelerated deep learning framework in Mojo - tensors, autograd, and neural network layers with custom GPU kernel implementations
distributed-llama.cpp Distributed LLM inference across machines: routes OpenAI-compatible requests to llama.cpp nodes with automatic model distribution, load balancing, and mutual TLS

Skills

Languages · Python · C · C++ · Go · Mojo · Java · JavaScript

DL / ML · DeepSpeed · PyTorch · TensorFlow · Keras · CUDA · scikit-learn

Training · Pretraining · Distributed Training · Fine-Tuning · PEFT · SFT · RLHF · RLAIF · DPO · GRPO · LoRA · QLoRA · Unsloth

GenAI · Transformers · Diffusers · vLLM · LangChain · LangGraph · LlamaIndex · llama.cpp

Infrastructure · Docker · Kubernetes · AWS · GCP · Azure · Terraform

Pinned Loading

  1. yoro-full-pretraining yoro-full-pretraining Public

    YORO (You-Only-Reason-Once) - a novel LLM architecture that runs the main reasoning block once, caches its output, and reuses it for all subsequent tokens. Lightweight auxiliary networks compensate…

    Python

  2. yoro-finetuning yoro-finetuning Public

    YORO (You-Only-Reason-Once) - a novel LLM architecture that runs the main reasoning block once, caches its output, and reuses it for all subsequent tokens. Lightweight auxiliary networks compensate…

    Jupyter Notebook

  3. llm-layer-prefetch llm-layer-prefetch Public

    Stream transformer layers a few at a time from disk to GPU using a three-stage pipelined prefetch queue: disk, CPU RAM, pinned RAM, and GPU transfers happen in parallel, with each stage prefetching…

    Python 1

  4. distributed-llama.cpp distributed-llama.cpp Public

    Distributed LLM inference across multiple machines. A central server routes OpenAI-compatible requests to llama.cpp client nodes, with automatic model distribution and mutual TLS security.

    Go

  5. self-attention-cuda-kernel-comparison self-attention-cuda-kernel-comparison Public

    Benchmarking hand-written CUDA C, Numba, and Triton self-attention kernels against PyTorch's SDPA - how fast can you go depending on the tool?

    Python

  6. cuda-kernel-verifier cuda-kernel-verifier Public

    Runtime correctness checker for custom CUDA kernels. Attach a single decorator to periodically verify outputs against a reference implementation, with outlier-biased sampling and zero training grap…

    Python 1