A Visual Benchmark for Financial Fact-Level OCR Evaluation
*Corresponding author: xueqing.peng2024@gmail.com
Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong OCR performance on surface metrics does not necessarily imply faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors, such as a missing negative marker, shifted decimal point, incorrect unit scale, or misaligned reporting date, can induce materially different interpretations.
To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating OCR and vision-language systems through the lens of evidence fidelity in high-stakes document understanding. FinCriticalED contains 859 real-world financial document pages paired with ground-truth HTML, with 9,481 expert-annotated facts spanning five financially critical field types: Numbers, Monetary Units, Temporal Data, Reporting Entities, and Financial Concepts.
We further develop an evaluation suite, including critical-field-aware metrics and a context-aware protocol, to assess whether model outputs preserve financially critical facts beyond lexical similarity. We benchmark XX OCR pipelines, OCR-native models, open-source VLMs, and proprietary MLLMs on FinCriticalED. Results show that conventional OCR metrics can substantially overestimate factual reliability, and that OCR-specialized systems may outperform much larger general-purpose MLLMs in preserving critical financial evidence under complex layouts. FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a broader testbed for high-stakes multimodal document understanding.
| Model | Size | General (%) | Fact-Level (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 | RL | E↓ | Rank | N-FFA | T-FFA | M-FFA | R-FFA | FC-FFA | FFA | Rank | ||
| OCR Pipelines | ||||||||||||
| MinerU2.5 | 1.2B | - | - | - | - | - | - | - | - | - | - | - |
| PP-OCRv5 | 0.07B | 97.54 | 96.55 | 3.10 | - | - | - | - | - | - | - | - |
| Specialized OCR VLMs | ||||||||||||
| DeepSeek-OCR | 6B | - | - | - | - | - | - | - | - | - | - | - |
| DeepSeek-OCR-2 | 3B | - | - | - | - | - | - | - | - | - | - | - |
| GLM-OCR | 0.9B | - | - | - | - | - | - | - | - | - | - | - |
| Open-source MLLMs | ||||||||||||
| Gemma-3n-E4B-it | 4B | 83.49 | 79.59 | 23.82 | - | - | - | - | - | - | - | - |
| Qwen3-VL-8B-Instruct | 8B | - | - | - | - | - | - | - | - | - | - | - |
| Llama-4-Maverick | 17B | 98.00 | 97.62 | 3.70 | - | - | - | - | - | - | - | - |
| Qwen3.5-397B-A17B | 397B | - | - | - | - | - | - | - | - | - | - | - |
| Proprietary MLLMs | ||||||||||||
| GPT-4o | - | - | - | - | - | - | - | - | - | - | - | - |
| GPT-5 | - | - | - | - | - | - | - | - | - | - | - | - |
| Claude-Sonnet-4.6 | - | 98.84 | 98.73 | 1.69 | - | - | - | - | - | - | - | - |
| Gemini-2.5-Pro | - | - | - | - | - | - | - | - | - | - | - | - |
R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy. Best General (%) results in teal. – = results pending.
@misc{he2025fincriticaledvisualbenchmarkfinancial,
title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation},
author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Ruoyu Xiang and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
year={2025},
eprint={2511.14998},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.14998},
}
}