Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU
created: Feb. 21, 2026, 8:57 p.m. | updated: Feb. 22, 2026, 7:23 a.m.
Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
Achieves 83x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).
Q4_K_M fits 10 more layers in VRAM (36 vs 26), reducing tier B transfers.
Don't run on multi-tenant/server systems Remove amd_iommu=off from /etc/default/grub , run update-grub , reboot 3 Patches NVIDIA DKMS ( os-mlock.c ) follow_pfn() was removed in kernel 6.12+.
The NVMe disappears from /dev/ while bound to VFIO High if wrong device — never run on your boot drive.
11 hours, 33 minutes ago: Hacker News