Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session
Introduction
I’ve previously studied many different speculative decoding acceleration techniques and attempted to implement several architectures using PyTorch, including model architecture, training, and inference scripts (fast-llm-inference). This time, of course, I have a new goal.
Read More »Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session