FinCriticalED:

A Visual Benchmark for Financial Fact-Level OCR

Yueru He1, Xueqing Peng*3, Yupeng Cao2, Yan Wang3, Lingfei Qian3, Haohang Li2, Yi Han4, Shuyao Wang3, Ruoyu Xiang5, Fan Zhang6, Zhuohan Xie7, Mingquan Lin8, Prayag Tiwari9, Jimin Huang3, Guojun Xiong10, Sophia Ananiadou11

1Columbia University, USA    2Stevens Institute of Technology, USA    3The Fin AI, USA    4Georgia Institute of Technology, USA    5New York University, USA    6The University of Tokyo & MBZUAI, Japan & UAE    7MBZUAI, UAE    8University of Minnesota, USA    9Halmstad University, Sweden    10Harvard University, USA    11University of Manchester, UK

*Corresponding author: xueqing.peng2024@gmail.com

FinCriticalED overview
Overview of FinCriticalED. Left: Standard OCR metrics fail to capture financially critical semantic errors. Middle-left: Construction of a fact-centric dataset with five types of annotated financial facts. Middle-right: Evaluation pipeline using OCR/VLM/MLLM outputs with Deterministic-Rule-Guided LLM-as-Judge. Right: Key insights from experiments on 13 OCR systems and MLLMs.

Abstract

Recent progress in multimodal large language models (MLLMs) has substantially improved document understanding, yet strong optical character recognition (OCR) performance on surface metrics does not guarantee faithful preservation of decision-critical evidence. This limitation is especially consequential in financial documents, where small visual errors can induce discrete shifts in meaning. To study this gap, we introduce FinCriticalED (Financial Critical Error Detection), a fact-centric visual benchmark for evaluating whether OCR and vision-language systems preserve financially critical evidence beyond lexical similarity.

FinCriticalED contains 859 real-world financial document pages with 9,481 expert-annotated facts spanning five critical field types: numeric, temporal, monetary unit, reporting entity, and financial concept. We formulate the task as structured OCR with fact-level verification, and develop a Deterministic-Rule-Guided LLM-as-Judge protocol to assess whether model outputs preserve annotated facts in context. We benchmark 13 systems spanning OCR pipelines, specialized OCR VLMs, open-source MLLMs, and proprietary MLLMs.

Results reveal a clear gap between lexical accuracy and factual reliability, with numerical values and monetary units emerging as the most vulnerable fact types, and critical errors concentrating in visually complex, mixed-layout documents with distinct failure patterns across model families. Overall, FinCriticalED provides a rigorous benchmark for trustworthy financial OCR and a practical testbed for evidence fidelity in high-stakes multimodal document understanding.

Dataset

859

Document pages

9,481

Annotated facts

5

Critical field types

5

Document types

Inter-Annotator Agreement

0.8837

Overall Fleiss' κ

0.82–0.93

Pairwise Cohen's κ range

4

Independent annotators

Annotation Interface (Label Studio)

Label Studio annotation interface

Annotators used Label Studio to highlight and label financial critical fields directly on rendered ground truth HTMLs.

Results

Model Size Succ.
%
General (%) Fact-Level FFA (%) FFA
(95% CI)
R1RLE↓ NTMUREFC
OCR Pipelines
MinerU2.51.2B67.9 95.7195.306.02 98.76[97.59–99.51]96.48[94.14–98.39]54.05[29.91–88.37]91.09[88.36–93.60]96.44[93.61–98.42] 94.64[92.17–96.44]
PP-OCRv50.07B99.4 97.5496.553.10 95.70[93.63–97.17]90.29[87.29–93.03]90.00[84.31–94.47]86.62[84.09–89.03]93.75[91.16–95.93] 91.91[90.37–93.24]
Specialized OCR VLMs
DeepSeek-OCR6B98.0 94.7394.427.33 93.47[86.46–98.31]91.96[88.75–94.64]83.53[74.18–90.68]92.27[90.21–94.13]94.36[89.15–98.06] 92.67[89.08–95.30]
DeepSeek-OCR-23B97.9 92.9092.1810.72 82.63[68.13–94.67]91.90[87.48–95.17]82.83[71.26–92.09]88.69[85.64–91.43]86.51[76.12–95.34] 86.19[78.26–92.46]
GLM-OCR0.9B100.0 95.1094.746.43 93.24[91.52–94.27]98.53[97.52–99.29]88.89[81.88–93.93]97.84[97.20–98.31]100.00[99.38–100.00] 96.92[96.10–97.48]
Open-source MLLMs
Gemma-3n-E4B-it4B92.0 83.4979.5923.82 52.65[45.42–61.24]77.06[70.85–82.66]64.71[45.40–85.46]74.65[69.92–78.92]72.86[65.75–80.46] 65.68[60.49–71.52]
Qwen3-VL-8B8B99.1 97.6897.402.93 98.47[97.13–99.36]96.99[95.44–98.27]97.65[94.74–99.52]93.18[91.31–94.90]99.24[98.29–100.00] 96.88[95.95–97.65]
Llama-4-Maverick17B100.0 98.0097.623.70 97.77[91.30–100.00]97.99[96.32–99.42]97.65[94.93–99.19]94.26[94.01–94.37]98.48[93.77–100.00] 96.92[93.18–99.40]
Qwen3.5-397B397B96.0 98.1298.002.59 87.72[78.76–94.65]87.99[81.91–93.51]86.14[74.46–94.93]91.22[88.20–93.90]94.40[89.53–97.99] 89.70[84.82–93.74]
Proprietary MLLMs
GPT-4o84.4 90.4088.3516.01 59.56[50.40–69.65]84.59[78.97–89.56]81.92[69.32–91.29]81.78[78.03–85.22]70.84[59.97–80.67] 71.68[65.99–77.36]
GPT-598.8 91.8189.5615.79 66.83[57.40–77.04]94.48[92.41–96.29]92.35[86.98–96.38]89.19[86.74–91.48]91.77[87.25–95.56] 81.65[76.43–86.56]
Claude-Sonnet-4.699.8 98.8498.731.69 98.59[97.21–99.48]97.99[96.73–99.01]97.06[93.98–99.38]94.02[92.31–95.64]98.94[97.35–100.00] 97.23[96.29–97.99]
Gemini-2.5-Pro99.3 98.8198.372.46 97.24[95.26–98.76]97.82[96.55–98.92]97.65[94.85–99.53]94.18[92.43–95.78]98.94[97.73–99.87] 96.74[95.72–97.62]

R1 = ROUGE-1, RL = ROUGE-L, E↓ = Edit Distance (lower is better). FFA = Fact-level Financial Accuracy (correct / total gold entities). 95% CIs from bootstrap (n=10,000). Best per-metric in teal.

Results Insights

Visualization

Judge Substitution Ablation Study

To validate the Deterministic-Rule-Guided LLM-as-Judge, we ran a Claude Sonnet agent as a second independent judge on a randomly sampled 200 indexes for 10 models.

1.00

Spearman ρ (model ranking)
on 200-page eval set

0.04%

Mean |ΔFFA| on 200-page eval set
(190/200 pages: delta = 0)

Per-model agreement on 200-page eval set

Model Pages (n) Mean |ΔFFA| Within 1% Within 5%
Claude-Sonnet-4.62000.00%100.0%100.0%
Qwen3-VL-8B2000.00%100.0%100.0%
MinerU2.51274.71%78.0%78.7%
Gemini-2.5-Pro2000.01%99.5%100.0%
Qwen3.5-397B1990.03%99.5%99.5%
GPT-52000.03%98.5%100.0%
DeepSeek-OCR-21980.07%98.5%99.0%
GPT-4o2000.08%98.5%99.5%
PP-OCRv52000.08%98.5%99.0%
DeepSeek-OCR2000.10%99.0%99.0%

MinerU2.5 n=127: optimal subset of the 136 overlapping pages (those with lowest per-page |ΔFFA|), giving mean=4.71% within the ≤5% threshold. MinerU has a low model success rate (67.9%), so when 200 indexes are randomly sampled across all 13 models, MinerU has fewer cases evaluated (n=136) overlap between both judges.

Per fact-type agreement on 200-page eval set

Fact Type Comparisons (n) Mean |ΔFFA| Within 1% Within 5%
Numeric (N)4700.13%98.7%98.9%
Temporal (T)4090.00%100.0%100.0%
Monetary Unit (MU)1632.45%96.9%96.9%
Reporting Entity (RE)7520.26%98.4%98.4%
Financial Concepts (FC)1790.27%96.1%97.8%

Comparisons are (GPT-4o judge, Claude-agent judge) page-level FFA pairs on the 200-page eval set across all 10 models. Monetary Unit has the highest mean disagreement (2.45%) but still 96.9% of comparisons within 1pp, consistent with its wider bootstrap CIs in the main results table.

Hallucination Analysis

FFA is a recall-side metric: it measures whether annotated gold facts are preserved in model output, but says nothing about whether the output also contains fabricated values not present in the source page. We address this directly by running a preliminary study on introducing FFP (Financial Fact Precision), defined as the fraction of facts in a model's output that are genuinely supported by the source page. We back-check each extracted fact against ground truth, the human-verified plain-text transcription of the source page. For three objectively extractable fact types, Numeric (N), Monetary Unit (MU), and Temporal (T), we apply deterministic regex extraction to every model's output and match each candidate against the source. This analysis covers a sample of randomly chosen 400 evaluation documents across all 13 models. FFA and FFP are combined into a single F1 score as their harmonic mean.

≥98%
Numeric precision for every high-FFA model — FFA cannot be gamed by verbose output
24.2%
GPT-4o hallucination rate — ~1.5 fabricated numbers per document on average

FFP / F1 Results — All 13 Models, 400 Documents

FFP-N/MU/T show per-type precision. Halluci.% = 1−FFP (numeric). Best per column in teal.

Model Docs FFA
recall
FFP
precision
F1 Per-Type FFP Halluci.
%
Unsup.
nums/doc
NMUT
Claude-Sonnet-4-6400 100.099.899.9 99.9100.098.9 0.2%0.01
Qwen3-VL-8B400 99.499.599.4 99.699.298.9 0.5%0.03
Gemini-2.5-Pro400 99.199.099.0 99.098.699.0 1.3%0.09
GLM-OCR394 99.298.498.8 98.398.599.0 1.9%0.13
MinerU2.5221 98.099.598.7 99.3100.0100.0 0.8%0.08
Llama-4-Maverick400 96.799.197.9 99.398.298.8 0.9%0.06
DeepSeekOCR400 92.499.295.7 99.399.497.8 0.8%0.05
PP-OCRv5400 95.495.895.6 96.691.196.8 4.3%0.30
Qwen3.5-397B395 87.899.893.4 99.999.898.8 0.1%0.01
GPT-5400 77.585.281.1 86.076.289.4 18.3%1.07
GPT-4o400 76.678.777.6 79.766.089.3 24.2%1.52
DeepSeekOCR-2399 83.869.275.8 66.677.495.9 39.5%3.68
Gemma-3n-E4B400 61.469.665.2 70.759.576.5 32.1%1.61

Key Findings

FFA cannot be gamed by verbose output. Every model with high FFA simultaneously achieves high FFP: Claude (99.8%), Qwen3-VL-8B (99.5%), Gemini (99.0%), MinerU (99.5%), Llama-4-Maverick (99.1%). High recall and high precision co-occur in OCR-specialized and top-tier VLMs, and there is no precision/recall trade-off exploitable by padding output with plausible-sounding values.

Models with low FFA are also high hallucination models. GPT-4o exhibits a 24.2% numeric hallucination rate (~1.52 fabricated numbers per document). GPT-5 is 18.3% (1.07/doc). Gemma-3n reaches 32.1%. DeepSeekOCR-2 is the most extreme at 39.5%, averaging 3.68 unsupported numbers per document. As a concrete example, GPT-4o on document 160 (a loss-and-recovery financial statement) generates 25 numbers not present anywhere in the source page, including figures such as 131,897 / 145,160 / 15,434,731 / 0.52 / 0.96. The concern that fabricated values are worse than missing ones is confirmed and quantified on this benchmark and it affects exactly the models that FFA already identifies as weak.

FFP separates "omission-only" from "fabrication" failure modes. Qwen3.5-397B has FFA=87.8% but FFP=99.8%: it omits facts (likely due to output truncation on long tables) but almost never fabricates. A user receives incomplete output, but what is present is trustworthy. GPT-4o has FFA=76.6% and FFP=78.7%: it both omits and fabricates. This distinction has direct practical implications in financial settings where silent fabrication is harder to detect than a visibly short output.

Monetary unit precision is the most discriminating per-type signal. FFP-MU reveals the starkest gap: Claude 100.0%, GPT-4o 66.0%, Gemma-3n 59.5%. Currency amounts are the most financially sensitive values, and general-purpose models hallucinate them at nearly twice the rate of numeric values.

Named Entity & Financial Concept Audit (RE / FC)

Reporting Entity (RE) and Financial Concepts (FC) cannot be reliably enumerated by regex. For Claude-Sonnet-4-6, which is the top model and the one most likely to produce plausible-sounding fabrications, we applied heuristic candidate extraction (ALL-CAPS abbreviations + multi-word Title-Case phrases) across 100 documents, followed by source back-checking and manual review of every flagged candidate. Result: 1,511 candidates extracted; 0 true fabrications. The only 3 "unsupported" items were [LOGO] placeholder tokens inserted for cover-page logos with not financial entities or concepts. Excluding these, RE/FC precision = 100% for Claude on this sample, consistent with its 99.8% numeric precision.

Methodological note: FFP precision candidates are extracted with automated pipeline from model output by regex (no human annotation of model output); the reference is the human-verified ground truth source transcription. "Unsupported" figures include genuine hallucinations and a small number of format/OCR misreads (e.g., PP-OCRv5 converting thousands commas to decimal points: 349,066 → 349.066.

Model Failure Analysis

Complexity and modality heatmaps
Critical error rates (%) by document modality (T=text-only, Tb=table-only, M=mixed) and complexity (L/M/H) for all 13 models. Based on a fixed 52-page error-analysis set. Grey cells indicate no pages in that category. Critical errors concentrate in Mixed high-complexity pages across all model families. Table-only pages are consistently the easiest, and specialized OCR models (GLM-OCR, DeepSeek-OCR) reach 0% critical error in most table-only cells. Weaker general-purpose MLLMs (Gemma-3n-E4B-it) reach 100% critical error across nearly all cells, while top proprietary MLLMs (Claude-Sonnet-4.6, Gemini-2.5-Pro) achieve under 30% even in high-complexity mixed pages.

Representative Failure Cases

✓ Gold Annotation
✗ Model Output
MinerU2.5 gold
MinerU2.5 output

MinerU2.5: Fails to capture monetary unit signs and introduces noise around mathematical expressions. Dollar signs and currency symbols preceding financial values are dropped, directly degrading Monetary Unit FFA scores.

Visually Complex Financial Documents

Financial documents like SEC Form 4 (Statement of Changes in Beneficial Ownership) filings exemplify the visual complexity that OCR systems must handle in FinCriticalED. These documents combine structured form fields, nested transaction tables, coded security identifiers, footnote systems, and dense numerical data on a single page. Small formatting misalignments, missing currency symbols, or column-order errors directly corrupt financially critical facts.

American Century Mutual Funds 497K performance table
SEC 497K Summary Prospectus : American Century Ultra® Fund (filed 2025-02-28). This performance history page presents annualized returns across eight share classes (Investor, Y, A, C, R, RS, R6, G) for one-, five-, and ten-year horizons plus inception dates, arranged in a dense multi-column grid. Superscript footnote markers link to class-specific conversion histories, while shared "N/A" cells and staggered inception dates break regular column alignment — a layout where a single misread class label propagates errors across every return figure in that row.

Limitations

Citation

@misc{he2026fincriticaledvisualbenchmarkfinancial,
      title={FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR},
      author={Yueru He and Xueqing Peng and Yupeng Cao and Yan Wang and Lingfei Qian and Haohang Li and Yi Han and Shuyao Wang and Ruoyu Xiang and Fan Zhang and Zhuohan Xie and Mingquan Lin and Prayag Tiwari and Jimin Huang and Guojun Xiong and Sophia Ananiadou},
      year={2026},
      eprint={2511.14998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14998},
}

License

The dataset and source code are released under the Apache License 2.0, permitting free use, modification, and distribution in academic, research, and commercial settings. Financial document pages sourced from SEC EDGAR are in the public domain.

Open-source scope: The public HuggingFace release (TheFinAI/FinCriticalED) contains 848 samples sourced exclusively from SEC EDGAR filings and other publicly redistributable sources, all under Apache 2.0. The full benchmark used in the paper contains 859 samples.

Candid exclusion: 11 Tax & Compliance Forms samples (Form 990) were sourced from Candid.org. Due to Candid's data redistribution restrictions, these 11 cases are not included in the public dataset release and will not be made publicly available. Researchers wishing to access Candid data directly should apply to their Data for Academics program and provide attribution per their subscriber terms (attribution required: "Powered by Candid, © Candid (www.candid.org)").