GenAI Updates Mike 11. Januar 2026 0 Kommentare

GitHub – bytedance/Dolphin: The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

If you’ve ever tried to extract clean, structured data from a messy PDF or a phone photo of a document, you know the feeling. You zoom in. You squint. You copy and paste. And somehow the table turns into soup.

That’s exactly the pain point behind Dolphin, an open source project from ByteDance that quietly does something pretty impressive. It parses document images, not just clean digital PDFs, but also photographed pages with mixed layouts, formulas, tables, code blocks, and all the odd combinations real documents throw at us.

The official repository, which you can explore here
https://github.com/bytedance/Dolphin
introduces Dolphin‑v2, an upgraded version presented at ACL 2025. And yes, it’s a meaningful upgrade, not just a version bump.

At its core, Dolphin‑v2 uses a document‑type‑aware, two‑stage architecture. In human terms, it first tries to understand what kind of document it’s looking at, then decides how to parse it. Think of it like how you instinctively read a textbook differently than a restaurant menu or a printed research paper. Same eyes, different expectations.

What stood out to me is the idea of heterogeneous anchor prompting. Instead of forcing every page into one rigid structure, Dolphin uses flexible anchors that adapt to different layouts. That’s why it can handle everything from dense academic PDFs to casually photographed notes. And it does this with a lightweight, parallel setup, so efficiency isn’t sacrificed for accuracy.

The repo is refreshingly practical. You can clone it, install dependencies, download pre‑trained models, or jump straight into inference with different parsing granularities. There’s even an open call for bad cases, which says a lot about the mindset behind the project. This isn’t “we’re done”. It’s “help us make it better”.

Looking ahead, tools like Dolphin‑v2 feel like quiet infrastructure. You don’t always notice them, but they make everything else possible. Better document understanding means better search, better automation, better accessibility. And honestly, fewer headaches the next time you’re staring at a stubborn PDF, wondering why it won’t behave.