Image missing.
Tokasaurus: An LLM inference engine for high-throughput workloads

created: June 5, 2025, 9:27 p.m. | updated: June 6, 2025, 6:12 p.m.

Tokasaurus: An LLM Inference Engine for High-Throughput WorkloadsTL;DRWe’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. Open-source inference engines (i.e. Optimizing Small ModelsTo benchmark Tokasaurus with small models, we’ll use two workloads:Completing chatbot prompts from the ShareGPT dataset (this is a common benchmark for testing inference engines).

2 days, 5 hours ago: Hacker News