Implemented the FlashAttention-2 algorithm (Tri Dao et al.) for efficient transformer training, inspired by OpenAI's Fused Attention work. Focused on optimizing attention computation and memory usage
Designed & implemented a full CUDA GEMM optimization pipeline, and achieved close to 92% performance of cuBLAS. Also analysed vLLM with Paged Attention paper.
Full implementation of the Transformer architecture in PyTorch: embeddings, positional encodings, multi-head attention, encoder–decoder blocks and training loop. Built mainly to understand the architecture deeply, including attention visualisation and ablation.
Series of research-style assignments implementing the APL framework, experimenting with noise schedules, and using GNNs for multi-modal tasks. Good exposure to controlled experimentation and analysis.
Retrieval-augmented question-answering system over PDFs. Handles chunking, embedding, vector search and a simple LLM wrapper. Designed to keep responses grounded in the retrieved context.
Training pipeline for ResNet-18 on CIFAR-10 using realistic tricks: data augmentation, cosine LR schedule, mixed precision, and gradient handling. Achieved ~90% accuracy with a clean, configuration-driven script.
Small interactive game controlled by hand gestures, using MediaPipe for landmark detection and OpenCV for rendering. Optimised to run in real time on modest hardware, which made latency and pipeline issues very visible.
DIY electronics, including a self-watering plant system, sensor projects and 3D-printed hardware. These projects gave me intuition for signals, noise and real-world debugging that feeds into my AI hardware work.
Research-oriented implementation of the Mamba architecture, experimenting with selective state-space mechanisms and long-sequence modelling efficiency.
Fast Dynamic LoRA variants using PEFT, aimed at efficient fine-tuning and rapid switching across multiple tasks.
Investigating GPU-Direct paths, bandwidth limitations and how they affect end-to-end training throughput for ML workloads.
Duolingo-style financial literacy platform aimed at students, with gamified modules and simple content. Currently in exploration / concept stage.
Attempted integrating perception and planning for an automated Rubik’s cube solver. The project stalled, but it forced me to think about closed-loop control and robustness.
Prototype for benchmarking AI models across hardware platforms. Didn’t move through the hackathon, but seeded later ideas for GPU benchmarking and optimisation work.
Early attempts at a SaaS platform for AI model benchmarking and auto-tuning on different devices. Parked until I gather more practical evidence from current research projects.
Several early experiments around study assistants, knowledge management and fast coding patterns. Most were intentionally short-lived, but provided signal on what’s worth pursuing.
Played with agentic frameworks and task routing between agents. Paused to deepen my foundations in core DL and GPU systems before revisiting.