However Briefly

Don't bother parsing: Just use images for RAG

Adityav369

created: July 21, 2025, 5:16 p.m. | updated: July 22, 2025, 3:16 p.m.

Vision Language Models have quietly gotten good enough to understand documents directly. These embeddings are then refined by a language model (PaliGemma-3B) that's been trained to understand document structure. On the ViDoRe benchmark, the first evaluation specifically designed for visual document retrieval, our approach achieves 81.3% nDCG@5 compared to 67.0% for traditional parsing methods. With visual document retrieval, you just send us your documents—PDFs, images, even photos of whiteboards—and search with natural language. Visual document understanding solves the foundational problem, but real-world document workflows demand much more than single-shot retrieval.

Read Full Article

2 weeks ago: Hacker News: Front Page