FinCriticalED:

A Visual Benchmark for Financial Fact-Level OCR

Yueru He1, Xueqing Peng*3, Yupeng Cao2, Yan Wang3, Lingfei Qian3, Haohang Li2, Yi Han4, Shuyao Wang3, Ruoyu Xiang5, Fan Zhang6, Zhuohan Xie7, Mingquan Lin8, Prayag Tiwari9, Jimin Huang3, Guojun Xiong10, Sophia Ananiadou11

1Columbia University, USA    2Stevens Institute of Technology, USA    3The Fin AI, USA    4Georgia Institute of Technology, USA    5New York University, USA    6The University of Tokyo & MBZUAI, Japan & UAE    7MBZUAI, UAE    8University of Minnesota, USA    9Halmstad University, Sweden    10Harvard University, USA    11University of Manchester, UK

*Corresponding author: xueqing.peng2024@gmail.com

FinCriticalED teaser figure
Overview of FinCriticalED. Left: Standard OCR metrics fail to capture financially critical semantic errors. Middle-left: Construction of a fact-centric dataset with five types of annotated financial facts. Middle-right: Evaluation pipeline using OCR/VLM/MLLM outputs with Deterministic-Rule-Guided LLM-as-Judge. Right: Key insights from experiments on 13 OCR systems and MLLMs.

Abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong optical character recognition (OCR) performance on surface metrics does not guarantee faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors can induce discrete shifts in meaning. To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating whether OCR and vision-language systems preserve financially critical evidence beyond lexical similarity.

FinCriticalED contains 859 real-world financial document pages with 9,481 expert-annotated facts spanning five critical field types: numeric, temporal, monetary unit, reporting entity, and financial concept. We formulate the task as structured OCR with fact-level verification, and develop a Deterministic-Rule-Guided LLM-as-Judge protocol to assess whether model outputs preserve annotated facts in context. We benchmark 13 systems spanning OCR pipelines, specialized OCR VLMs, open-source MLLMs, and proprietary MLLMs.

Results reveal a clear gap between lexical accuracy and factual reliability, with numerical values and monetary units emerging as the most vulnerable fact types, and critical errors concentrating in visually complex, mixed-layout documents with distinct failure patterns across model families. Overall, FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a practical testbed for evidence fidelity in high-stakes multimodal document understanding.

Dataset

859

Document pages

9,481

Annotated facts

5

Critical field types

5

Document types

Inter-Annotator Agreement

0.8837

Overall Fleiss' κ

0.82–0.93

Pairwise Cohen's κ range

4

Independent annotators

Annotation Interface (Label Studio)

Label Studio annotation interface

Annotators used Label Studio to highlight and label financial critical fields directly on rendered ground truth HTMLs.

Results

Model Size General (%) Fact-Level (%)
R1RLE↓ N-FFAT-FFAM-FFAR-FFAFC-FFAFFA
OCR Pipelines
MinerU2.51.2B 95.7195.306.02 98.7696.4854.0591.0996.4494.64
PP-OCRv50.07B 97.5496.553.10 95.790.2990.0086.6293.7591.91
Specialized OCR VLMs
DeepSeekOCR3B 94.7394.427.33 93.4791.9683.5392.2794.3692.67
DeepSeekOCR-26B 92.9092.1810.72 82.6391.982.8388.6986.5186.19
GLM-OCR0.9B 95.1094.746.43 93.2498.5388.8997.84100.0096.92
Open-source MLLMs
Gemma-3n-E4B-it4B 83.4979.5923.82 52.6577.0664.7174.6572.8665.68
Qwen3-VL-8B-Instruct8B 97.6897.402.93 98.4796.9997.6593.1899.2496.88
Llama-4-Maverick17B 98.0097.623.70 97.7797.9997.6594.2698.4896.48
Qwen3.5-397B-A17B397B 98.1298.002.59 87.7287.9986.1491.2294.489.70
Proprietary MLLMs
GPT-4o- 90.4088.3516.01 59.5684.5981.9281.7870.8471.68
GPT-5- 91.8189.5615.79 66.8394.4892.3589.1991.7781.65
Claude-Sonnet-4.6- 98.8498.731.69 98.5997.9997.0694.0298.9497.23
Gemini-2.5-Pro- 98.8198.372.46 97.2497.8297.6594.1898.9496.74

R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better), FFA = Fact-level Financial Accuracy. Best General (%) results in teal. – = results pending.

Results Insights

Visualization

Model Failure Analysis

Complexity and modality heatmaps
Critical error rates by document modality and complexity level. Critical errors are not uniformly distributed; they concentrate in visually complex, mixed-layout documents containing dense tables, multi-column structures, and interleaved text and figures. Mixed pages are consistently the hardest setting. Weaker general-purpose MLLMs exhibit very high critical error rates in this category, with Gemma-3n-E4B-it reaching 100.0% across all complexity levels and GPT models remaining at 80.0%–100.0%. By contrast, OCR-specialized models are more stable, though not error-free: on text-only pages, DeepSeekOCR rises from 0.0% at low complexity to 23.1% at high complexity, and MinerU2.5 rises from 0.0% to 38.5%. Table-only pages are generally easier when layouts are simple. Overall, the main challenge is OCR under heterogeneous structure rather than OCR alone.

Representative Failure Cases

✓ Gold Annotation
✗ Model Output
MinerU2.5 gold annotation
MinerU2.5 model output

MinerU2.5: MinerU2.5 fails to capture monetary unit signs and introduces noise around mathematical expressions. Dollar signs and currency symbols preceding financial values are dropped, directly degrading Monetary Unit FFA scores. Additionally, mathematical expressions are surrounded by spurious characters introduced during OCR post-processing.

Citation

@misc{he2026fincriticaledvisualbenchmarkfinancial,
      title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR}, 
      author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Shuyao Wang and Ruoyu Xiang and Fan Zhang and Zhuohan Xie and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
      year={2026},
      eprint={2511.14998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14998}, 
}
}

License

The dataset and source code are released under the Apache License 2.0, permitting free use, modification, and distribution in academic, research, and commercial settings. It is the authors' responsibility to ensure that all datasets and source code are licensed such that they can be legally and freely used, at a minimum in academic and research settings.