FinCriticalED:

A Visual Benchmark for Financial Fact-Level OCR Evaluation

Yueru He, Xueqing Peng*, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Shuyao Wang, Ruoyu Xiang, Fan Zhang, Zhuohan Xie, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

^*Corresponding author: xueqing.peng2024@gmail.com

📄 arXiv 💻 Code 🤗 Dataset 📖 Supplementary Info(ACM MM)

Abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong OCR performance on surface metrics does not necessarily imply faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors, such as a missing negative marker, shifted decimal point, incorrect unit scale, or misaligned reporting date, can induce materially different interpretations.

To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating OCR and vision-language systems through the lens of evidence fidelity in high-stakes document understanding. FinCriticalED contains 859 real-world financial document pages paired with ground-truth HTML, with 9,481 expert-annotated facts spanning five financially critical field types: Numbers, Monetary Units, Temporal Data, Reporting Entities, and Financial Concepts.

We further develop an evaluation suite, including critical-field-aware metrics and a context-aware protocol, to assess whether model outputs preserve financially critical facts beyond lexical similarity. We benchmark XX OCR pipelines, OCR-native models, open-source VLMs, and proprietary MLLMs on FinCriticalED. Results show that conventional OCR metrics can substantially overestimate factual reliability, and that OCR-specialized systems may outperform much larger general-purpose MLLMs in preserving critical financial evidence under complex layouts. FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a broader testbed for high-stakes multimodal document understanding.

Results

Model	Size	General (%)				Fact-Level (%)
Model	Size	R1	RL	E↓	Rank	N-FFA	T-FFA	M-FFA	R-FFA	FC-FFA	FFA	Rank
OCR Pipelines
MinerU2.5	1.2B	-	-	-	-	-	-	-	-	-	-	-
PP-OCRv5	0.07B	97.54	96.55	3.10	-	-	-	-	-	-	-	-
Specialized OCR VLMs
DeepSeek-OCR	6B	-	-	-	-	-	-	-	-	-	-	-
DeepSeek-OCR-2	3B	-	-	-	-	-	-	-	-	-	-	-
GLM-OCR	0.9B	-	-	-	-	-	-	-	-	-	-	-
Open-source MLLMs
Gemma-3n-E4B-it	4B	83.49	79.59	23.82	-	-	-	-	-	-	-	-
Qwen3-VL-8B-Instruct	8B	-	-	-	-	-	-	-	-	-	-	-
Llama-4-Maverick	17B	98.00	97.62	3.70	-	-	-	-	-	-	-	-
Qwen3.5-397B-A17B	397B	-	-	-	-	-	-	-	-	-	-	-
Proprietary MLLMs
GPT-4o	-	-	-	-	-	-	-	-	-	-	-	-
GPT-5	-	-	-	-	-	-	-	-	-	-	-	-
Claude-Sonnet-4.6	-	98.84	98.73	1.69	-	-	-	-	-	-	-	-
Gemini-2.5-Pro	-	-	-	-	-	-	-	-	-	-	-	-

R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy. Best General (%) results in teal. – = results pending.

Citation

@misc{he2025fincriticaledvisualbenchmarkfinancial,
      title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation}, 
      author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Ruoyu Xiang and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
      year={2025},
      eprint={2511.14998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14998}, 
}
}