Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing

Can Li Ting Zhang Junbo Zhao Hua Huang
Beijing Normal University

Abstract

Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.

1 Introduction

Geometry Problem Solving (GPS) aims to derive mathematical solutions from a textual problem description and its corresponding diagram (Chen et al., 2022; Gelernter et al., 1960; Sachan and Xing, 2017; Wu, 1986; Wu et al., 2024; Peng et al., 2023). To tackle the inherent complexity of geometry, many recent systems adopt the neuro-symbolic paradigm (Trinh et al., 2024; Chervonyi et al., 2025). This dual-process approach utilizes neural networks as a "perceptual front-end" and symbolic engines as a "logical back-end" (Wu et al., 2022a). Within this framework, the reasoning pipeline is driven by two core modules: (i) Autoformalization, which translates multimodal raw inputs into executable formal representations, and (ii) Theorem Prediction, which selects or proposes theorem instances that can be used by the solver to derive solutions. This modular architecture allows systems to leverage the intuitive recognition of deep learning alongside the deterministic rigor of formal logic.

In the autoformalization stage, the objective is to transform multimodal inputs into solver-ready predicates. However, existing GPS frameworks typically adopt a “decoupled-and-rectified” paradigm (Lu et al., 2021; Zhao et al., 2025; Ping et al., 2026), in which isolated neural modules independently parse diagrams and formalize text, followed by a post-hoc rectification step to resolve referential ambiguities (Zhao et al., 2025; Ping et al., 2026). Beyond this structural fragility, a more fundamental limitation lies in objective misalignment. Current approaches treat formalization as a static, one-way semantic translation task, optimizing against expensive human-annotated logical forms that emphasize linguistic fidelity rather than computational solvability. As a result, the formalization process is not aligned with the ultimate goal of solver-based problem solving, and high semantic accuracy does not necessarily translate into effective or executable representations for downstream reasoning.

Even when provided with accurate formalization, the theorem prediction stage faces a second critical bottleneck: a fixed theorem library and finite search budget can leave the solver at a deductive impasse. State-of-the-art GPS frameworks (Ping et al., 2026; Zhang et al., 2024b) usually operate with a pre-defined library of geometric rules, so they may fail when a problem requires a rarely selected theorem instance, a small auxiliary construction, or an intermediate relation such as cyclicity, similarity, or proportionality. We therefore treat theorem proposing as a bounded, verifier-controlled proposal problem rather than as free-form axiom generation: a neural agent may suggest local auxiliary lemmas from the current proof state, but only the symbolic back-end can accept, reject, and use them.

Refer to caption — Figure 1: Performance comparison among existing methods.

To resolve these misalignments, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization replaces the decoupled pipeline with a unified multimodal formalizer built on QwenVL3-2B. It reads the raw diagram and problem text jointly, learns the target formal language through supervised adaptation, and is further optimized with solvability-guided feedback from the solver. This makes executability, rather than surface-level textual similarity alone, the central training signal. Second, Solver-Driven Theorem Proposing addresses solver-side impasses. Instead of allowing a neural model to add unchecked axioms, the agent proposes bounded auxiliary lemmas or theorem instantiations from the current proof state, and every accepted proposal must pass symbolic verification before it can drive the derivation.

Empirical evaluations on Geometry3K Lu et al. (2021) and PGPS9K Zhang et al. (2023) show consistent improvements over MLLM, neural, and neuro-symbolic baselines. As summarized in Figure 1, SD-GPS improves completion accuracy on both benchmarks over the strongest listed prior systems. The results suggest that closing the loop between multimodal perception and symbolic execution improves both solver-ready formalization and final geometric reasoning.

In summary, our contributions are three-fold:

•

Solver-Driven Autoformalization: We unify supervised formal-language adaptation and solvability-guided reinforcement learning into a single autoformalization module that produces solver-executable predicates from raw diagram-text inputs.
•

Verified Theorem Proposing: We introduce an impasse-aware theorem-proposing agent that proposes auxiliary lemmas from solver states, while preserving soundness by accepting only proposals verified by the symbolic back-end.
•

Systematic Evaluation: We evaluate under standard completion and choice settings, cross-modal point-reference regimes, and OCR-induced failure modes, showing robust gains on Geometry3K and PGPS9K.

2 Related Work

Geometry problem solving. The field of GPS has evolved from classical rule-based engines to contemporary neural-symbolic hybrids. Early symbolic methods, such as Wu’s Method (Wu, 1978) and Groebner Basis (Buchberger, 1976), established a foundation of mathematical rigor but were fundamentally limited by their inability to interpret informal multimodal inputs. To bridge this gap, modern frameworks (Wu et al., 2024; Peng et al., 2023) like Inter-GPS (Lu et al., 2021) and UniGeo (Cheng et al., 2025) integrate deep learning parsers with symbolic solvers. However, these systems predominantly rely on a decoupled paradigm where text and diagrams are processed by isolated neural modules. Recent advancements, notably Pi-GPS (Zhao et al., 2025) and AutoGPS (Ping et al., 2026), attempt to mitigate this isolation by introducing specialized post-hoc rectification stages or diagrammatic heuristics to resolve referential ambiguities. Despite these efforts, such systems remain tethered to the quality of initial independent parsing results. This structural separation often leads to cascading information loss, as the formalization process lacks a unified representation to resolve inter-modal dependencies, particularly when faced with complex spatial relationships that are under-specified in text or diagram.

Multimodal autoformalization. Autoformalization bridges the gap between informal human intuition and machine-verifiable logic (Wu et al., 2022b). In geometry, this task is particularly formidable as linguistic descriptions are often logically under-specified without visual grounding. The significance of this field is underscored by MATP-BENCH (He et al., 2025), which highlights that formally verified multimodal mathematics remains a grand challenge for frontier MLLMs. Recent frameworks, such as AutoGPS (Ping et al., 2026) and Pi-GPS (Zhao et al., 2025), attempt to address this by translating diagrams into predicates. However, they predominantly rely on a decoupled paradigm that utilizes specialized, non-transformer parsers like PGDP (Hao et al., 2022), necessitating post-hoc heuristics to resolve inter-modal inconsistencies. Similarly, DFE-GPS (Zhang et al., 2024c) treats diagram formalization merely as an auxiliary signal rather than a primary objective. While recent MMFORMALIZER (Xiong et al., 2026) extends autoformalization to physics via recursive grounding and rule-based searches in PhysLean, it focuses on high-level axiomatic deduction. In contrast, our framework achieves joint diagram-text alignment within a unified LMM representation. By enabling bidirectional disambiguation in a single inference pass, we resolve low-level topological ambiguities directly from raw pixels, offering a more integrated and scalable pathway for verifiable multimodal reasoning.

Verifiable reinforcement learning. Reinforcement learning has also been explored in geometry reasoning, for example GeoDRL formulates deductive reasoning as sequential decision making over geometry logic graphs (Peng et al., 2023). More recently, LLM reasoning has shifted toward verifiable rewards (DeepSeek-AI, 2025), where correctness can be computed automatically from math or code outputs instead of relying on human preference labels. DeepSeekMath popularized this paradigm in mathematical reasoning with GRPO (Shao et al., 2024), while GSPO later improved training stability and efficiency through sequence-level policy optimization (Zheng et al., 2025). In contrast, our work anchors optimization on solver-readiness of the formal representation. We adapt GSPO to multimodal autoformalization with rewards derived from parse validity, executable solver states, and answer-level solvability. When gold formal annotations are available, we use them for supervised adaptation and analysis; the reinforcement-learning stage itself can be driven by parser and solver execution feedback, making the objective closer to the downstream symbolic task.

3 Method

3.1 Framework Overview

Given a geometry problem consisting of a diagram $I$ , a textual question $T$ , and a target answer $a^{\ast}$ , SD-GPS aims to generate a formal representation $\hat{\Lambda}$ that can be consumed by a symbolic solver $\mathcal{S}$ . The system is organized around two solver-coupled loops. The first loop learns a multimodal formalizer $F_{\theta}$ :

\hat{\Lambda}=F_{\theta}(I,T),

(1)

where $\hat{\Lambda}$ contains formal predicates grounded in both the diagram and the text. The second loop augments symbolic deduction with a theorem-proposing agent. When $\mathcal{S}(\hat{\Lambda})$ cannot derive the target answer, the agent observes the intermediate proof state and proposes a theorem hypothesis $\hat{\tau}$ , after which the solver verifies whether $\hat{\tau}$ is applicable and useful for continuing the proof. This design keeps all accepted formalizations and theorem proposals grounded in symbolic execution.

Figure 2 illustrates the overall workflow of SD-GPS, including multimodal formalization, solver feedback, repair, and theorem proposal.

3.2 Formalization Data: Curated and Synthetic Sources

To expose the formalizer to realistic textbook phrasing, diverse diagram styles, and annotation noise, we build a mixed formalization corpus from curated real-world examples and controllable synthetic instances. The real-world portion is constructed from Geometry3K-Train, the training split of PGPS9K, and the PGDP training set. Since these datasets differ in annotation conventions and were not originally designed as a unified strict formal-language benchmark, we normalize symbolic expressions, entity names, and diagram-text grounding formats. Geometry3K annotations are proofread and standardized to reduce inconsistent predicate usage and ambiguous entity references. For PGPS9K, diagram-side formal language is converted into the Inter-GPS formalism, disambiguated during preprocessing, and then filtered by parser and solver execution. Samples whose converted predicates fail parsing, entity grounding, or solver verification are removed.

The synthetic portion is generated from geometry templates that specify a diagram construction program, a predicate-level formalization, and a target query. Template parameters control point layout, label visibility, auxiliary marks, algebraic values, and ambiguity type. We use the solver to reject inconsistent constructions and keep only examples whose predicates are executable and whose target answer is uniquely determined. This gives the formalizer additional supervision for cases that are underrepresented in real data, such as omitted labels, diagram-supplied points, dense angle marks, and visually similar point names.

The final training data are tuples $(I,T,\Lambda^{\ast},a^{\ast},m)$ , where $I$ denotes the diagram, $T$ denotes the problem text, $\Lambda^{\ast}$ denotes the target formal representation, $a^{\ast}$ denotes the gold answer, and $m$ stores metadata about entity grounding, ambiguity type, annotation source, and solver status. This metadata supports curriculum sampling. Early training emphasizes clean, solver-verified examples with direct text-diagram alignment, while later training gradually increases the proportion of diagram-dependent, ambiguity-heavy, and cross-modal-mismatch examples. We keep training, validation, and test problems separated at the problem and diagram levels to reduce leakage between curated or template-derived instances and held-out evaluation examples.

3.3 Solver-Driven Autoformalization

Solver-Driven Autoformalization is the front-end loop that converts raw multimodal inputs into a formal representation that the solver can execute. Rather than first extracting OCR tokens, diagram primitives, and textual predicates through separate modules, the formalizer $F_{\theta}$ receives the raw diagram $I$ and problem text $T$ jointly and emits a complete predicate sequence $\hat{\Lambda}$ in one pass. A lightweight parser checks arity, type compatibility, duplicated declarations, illegal symbols, and unresolved references before any candidate is passed to the symbolic solver $\mathcal{S}$ . In our implementation, $F_{\theta}$ is initialized from QwenVL3-2B (Bai et al., 2025); all supervised, RL, repair, and theorem-proposal variants use this same 2B-parameter multimodal backbone unless explicitly stated otherwise. The same parser and solver are then used in both training and inference, so the model is optimized toward the interface actually consumed by the downstream reasoner.

Stage I: supervised formal-language adaptation.

Given the curated and synthetic formalization dataset

\mathcal{D}_{\mathrm{SFT}}=\left\{(I_{i},T_{i},\Lambda_{i}^{\ast},a_{i}^{\ast})\right\}_{i=1}^{N},

(2)

we first adapt QwenVL3-2B, a 2B-parameter pretrained multimodal language model, to the target geometry formal language. The model generates

\hat{\Lambda}_{i}=F_{\theta}(I_{i},T_{i}),

(3)

and is trained with the autoregressive negative log-likelihood of the target predicate sequence $\Lambda_{i}^{\ast}=(y_{i,1},\ldots,y_{i,L_{i}})$ :

\mathcal{L}_{\mathrm{SFT}}(\theta)=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{L_{i}}\log p_{\theta}\left(y_{i,t}\mid I_{i},T_{i},y_{i,<t}\right).

(4)

This stage teaches the predicate inventory, output grammar, and cross-modal grounding patterns, providing a solver-compatible initialization before execution-based optimization.

Stage II: solvability-guided policy optimization.

After supervised adaptation, we further optimize $F_{\theta}$ with Solvability-Guided Reinforcement Learning (SG-RL), instantiated with GSPO-style group sequence policy optimization. For each problem, the model samples a group of $K$ candidate formalizations $\{\hat{\Lambda}_{i,j}\}_{j=1}^{K}$ . Each candidate is parsed and, if valid, executed by $\mathcal{S}$ . We assign a sequence-level reward

R(\hat{\Lambda})=\lambda_{\mathrm{fmt}}R_{\mathrm{fmt}}(\hat{\Lambda})+\lambda_{\mathrm{exec}}R_{\mathrm{exec}}(\hat{\Lambda})+\lambda_{\mathrm{ans}}R_{\mathrm{ans}}(\hat{\Lambda},a^{\ast}),

(5)

where

R_{\mathrm{fmt}}(\hat{\Lambda})=\mathbb{I}[\mathrm{Parse}(\hat{\Lambda})=1],

(6)

R_{\mathrm{exec}}(\hat{\Lambda})=\mathbb{I}[\mathcal{S}\ \text{constructs a non-contradictory proof state from }\hat{\Lambda}],

(7)

and

R_{\mathrm{ans}}(\hat{\Lambda},a^{\ast})=\mathbb{I}[\mathcal{S}(\hat{\Lambda})=a^{\ast}].

(8)

The format term rewards syntactic validity, the execution term rewards usable intermediate solver states, and the answer term rewards candidates that are sufficient for deriving the gold answer. Invalid or unparsable candidates receive no execution or answer reward. Within each sampled group, rewards are normalized into sequence-level advantages and used to update the policy, so formalizations that are merely plausible but unusable by the solver are discouraged. Unlike the supervised stage, SG-RL does not require gold formal-language annotations for every problem; it requires the problem input, the target answer, and solver feedback.

Inference-time use.

At test time, the same parser-solver interface provides structured diagnostics. If a candidate fails parsing or execution, it is sent to the bounded repair loop described in Appendix A.2. If the candidate is executable but the solver cannot close the proof within the search budget, the system invokes the theorem-proposing agent in Section 3.4.

3.4 Solver-Driven Theorem-Proposing Agent

Even a correct and executable formalization may be insufficient when the symbolic solver is restricted to a fixed theorem library or a finite search budget. The Solver-Driven Theorem-Proposing Agent addresses this bottleneck by acting as a proposal module over the current proof state, not as an unchecked source of new axioms. In our implementation, the agent is driven by the same QwenVL3-2B backbone as the formalizer, but its output is restricted to a verifier-readable proposal schema. It is activated only after $\mathcal{S}$ has successfully parsed $\hat{\Lambda}$ and built a consistent proof state but cannot derive the target query $q$ . At iteration $t$ , the solver exposes a state summary $h_{t}$ containing known facts, unresolved subgoals, failed theorem matches, and a compact diagnostic message $d_{t}$ . Conditioned on $(\hat{\Lambda},q,h_{t},d_{t})$ , the agent proposes a bounded set of candidate auxiliary lemmas

\mathcal{T}_{t}=A_{\phi}(\hat{\Lambda},q,h_{t},d_{t}),

(9)

where each candidate is represented as a verifier-readable tuple containing a lemma type, required premises, proposed conclusion, involved entities, and a short rationale.

The solver filters these candidates before any of them can affect the derivation. A proposal is discarded if it contains unknown entities, violates type constraints, contradicts existing facts, cannot be instantiated against the current proof state, cannot be verified from admissible premises, or fails to reduce any unresolved subgoal. For an accepted proposal, the verifier returns a local proof certificate or a deterministic construction trace, and the derived fact is scoped only to the current problem instance. It is removed after the example is solved or declared unresolved. This scoping is important: the agent may suggest a useful lemma instantiation or auxiliary relation, but the symbolic verifier remains responsible for deciding whether the suggestion is admissible. In practice, the agent mostly helps in three cases: selecting a theorem instance that the default search failed to prioritize, exposing a latent relation such as cyclicity or similarity, and proposing a small auxiliary construction that makes an existing theorem applicable.

Algorithm 1 Solver-driven theorem proposal and verification

1:Formal predicates

\hat{\Lambda}

, target query

q

, solver

\mathcal{S}

, theorem agent

A_{\phi}

, maximum proposal rounds

T_{\max}

, proposal budget

K

2:Solved answer or unresolved status

(s_{0},h_{0})\leftarrow\mathcal{S}.\textsc{Initialize}(\hat{\Lambda},q)

4:if

s_{0}

is invalid or contradictory then

5: return unresolved

\triangleright

Handled by formalization repair, not theorem proposal

6:end if

7:if

\mathcal{S}.\textsc{Closed}(s_{0},q)

then

8: return

\mathcal{S}.\textsc{Answer}(s_{0},q)

9:end if

10:for

t=1

T_{\max}

11:

s_{t}\leftarrow s_{t-1}

12:

d_{t}\leftarrow\mathcal{S}.\textsc{Diagnose}(s_{t-1},q)

13:

\mathcal{T}_{t}\leftarrow A_{\phi}(\hat{\Lambda},q,h_{t-1},d_{t};K)

14: for all

\tau\in\mathcal{T}_{t}

15: if not

\textsc{WellTyped}(\tau,\hat{\Lambda})

then

16: discard

\tau

17: else if not

\mathcal{S}.\textsc{Applicable}(\tau,s_{t-1})

then

18: discard

\tau

19: else if not

\mathcal{S}.\textsc{Verify}(\tau,s_{t-1})

then

20: discard

\tau

21: else if not

\mathcal{S}.\textsc{GoalReduced}(\tau,s_{t-1},q)

then

22: discard

\tau

23: else

24:

(s_{t},c_{\tau})\leftarrow\mathcal{S}.\textsc{AddLocalLemma}(s_{t-1},\tau)

25:

s_{t}\leftarrow\mathcal{S}.\textsc{Search}(s_{t},q)

26: if

\mathcal{S}.\textsc{Closed}(s_{t},q)

then

27: return

\mathcal{S}.\textsc{Answer}(s_{t},q)

28: end if

29: end if

30: end for

31:

h_{t}\leftarrow\mathcal{S}.\textsc{Summarize}(s_{t},q)

32:end for

33:return unresolved

Algorithm 1 summarizes the loop. The proposal budget keeps inference bounded, while the verifier checks ensure that neural proposals cannot directly determine the answer. The stored certificate $c_{\tau}$ is used only inside the current proof attempt, so accepted proposals behave as verified local lemmas rather than permanent changes to the theorem library. The agent therefore expands the solver’s search behavior without replacing symbolic proof checking.

4 Experiments

Table 1: Results on Geometry3K and PGPS9K.

MLLMs
Method	Geometry3K		PGPS9K
	Completion	Choice	Completion	Choice
G-LLaVA-13B (Gao et al., 2025)	0.3	29.0	0.0	27.0
Vision-R1-7B (Huang et al., 2025)	43.8	57.1	36.8	49.6
Qwen2.5-VL-32B (Qwen et al., 2025)	41.9	67.6	43.5	56.1
InternVL2.5-78B (Chen et al., 2024)	36.1	60.9	28.1	51.3
InternVL3-78B (Zhu et al., 2025)	57.4	74.5	48.9	61.1
GPT-4o (OpenAI, 2024)	34.8	58.6	38.2	56.8
GLM-4.6V (Team et al., 2026b)	53.7	65.3	47.7	60.8
Kimi-K2.5 (Team et al., 2026a)	82.9	87.2	73.0	79.8
QwenVL3-2B (Bai et al., 2025)	36.6	52.5	27.2	45.4
Neural Solvers
NGS (Chen et al., 2022)	35.3	58.8	34.1	46.1
Geoformer (Khomiakov et al., 2024)	36.8	59.3	35.6	47.3
SCA-GPS (Ning et al., 2023)	76.7	–	–	–
GOLD (Zhang and Moshfeghi, 2024)	–	62.7	–	60.6
PGPSNet-v2-S (Zhang et al., 2024a)	65.2	76.4	60.3	69.2
LANS (Diagram GT) (Li et al., 2024)	72.1	82.3	66.7	74.0
PGPSNet (Zhang et al., 2023)	65.0	77.9	62.7	70.4
Symbolic and Neuro-symbolic Solvers
Inter-GPS (Lu et al., 2021)	43.4	57.5	–	–
E-GPS (Wu et al., 2024)	–	67.9	–	–
GeoDRL (Peng et al., 2023)	57.9	68.4	55.6	66.7
Auto-GPS (Ping et al., 2026)	75.4	81.6	75.4	81.5
Auto-GPS (OCR) (Ping et al., 2026)	65.6	75.2	62.9	71.9
Pi-GPS (Zhao et al., 2025)	70.6	77.8	61.4	69.8
Pi-GPS (OCR) (Zhao et al., 2025)	66.1	73.1	56.3	64.2
SD-GPS	86.4	90.4	79.8	84.5

Datasets.

We evaluate on Geometry3K (Lu et al., 2021) and PGPS9K (Zhang et al., 2023), the two benchmarks used to test whether solver-driven formalization transfers from raw multimodal input to final problem solving. Geometry3K contains 3,002 geometry problems, split into 2,101 training, 300 validation, and 601 test problems. Each problem includes a diagram, problem text, and formal-language annotations. PGPS9K extends Geometry3K to 9,022 problems paired with 4,000 unique diagrams and covers common plane-geometry types from middle-school and high-school textbooks.

Metrics.

Following prior GPS evaluations (Lu et al., 2021; Zhang et al., 2023; Ping et al., 2026; Zhao et al., 2025), we report final problem-solving accuracy under two settings. Completion evaluates open-ended answer generation, where the system must derive the final value or expression. Choice evaluates multiple-choice solving. For ablations on ambiguity resolution, we additionally report final solver-executed accuracy under three text-diagram evidence regimes: text-complete, where the text-mentioned point set equals the diagram point set; diagram-augmented, where the diagram contains additional points required for grounding; and cross-modal mismatch, where the model must resolve weaker or inconsistent entity evidence.

Baselines.

We compare with three groups of methods that correspond to the failure modes discussed in the introduction. The first group contains general MLLMs prompted to solve or formalize the problem directly, including G-LLaVA (Gao et al., 2025), Vision-R1 (Huang et al., 2025), Qwen2.5-VL (Qwen et al., 2025), QwenVL3-2B (Bai et al., 2025), InternVL (Chen et al., 2024; Zhu et al., 2025), and GPT-4o (OpenAI, 2024). The second group contains neural GPS systems such as NGS (Chen et al., 2022), Geoformer (Khomiakov et al., 2024), SCA-GPS (Ning et al., 2023), GOLD (Zhang and Moshfeghi, 2024), PGPSNet (Zhang et al., 2023), PGPSNet-v2 (Zhang et al., 2024a), and LANS (Li et al., 2024). The third group contains symbolic or neuro-symbolic solvers such as Inter-GPS (Lu et al., 2021), E-GPS (Wu et al., 2024), GeoDRL (Peng et al., 2023), Auto-GPS (Ping et al., 2026), and Pi-GPS (Zhao et al., 2025). Where available, we separately report OCR-based variants (Ping et al., 2026; Zhao et al., 2025), because they expose the brittleness of cascaded perception-to-logic pipelines under fully automatic input processing.

4.1 Results

Table 1 summarizes the results on Geometry3K and PGPS9K. SD-GPS achieves the best accuracy across all four evaluation settings. On Geometry3K, it obtains 86.4% completion accuracy and 90.4% choice accuracy, improving over the strongest listed prior results by 3.5 and 3.2 percentage points, respectively. On PGPS9K, SD-GPS reaches 79.8% completion accuracy and 84.5% choice accuracy, surpassing the strongest listed prior results by 4.4 and 3.0 percentage points. These gains indicate that solver-driven formalization improves not only predicate quality but also final reasoning performance.

The comparison also highlights the limitations of direct MLLM prompting. Although recent MLLMs perform well on general multimodal tasks, their accuracy remains limited on formal geometry reasoning, especially in completion settings where exact executable reasoning is required. The prompted QwenVL3-2B baseline is therefore reported separately from SD-GPS: SD-GPS uses the same backbone but adds solver-driven formal-language adaptation, SG-RL, bounded repair, and verified theorem proposal. In contrast to direct prompting, SD-GPS explicitly converts multimodal inputs into solver-ready predicates and uses execution feedback to penalize incomplete or unusable formalizations. The OCR-based variants of Auto-GPS and Pi-GPS further show that cascaded systems are sensitive to upstream recognition errors, reinforcing the need for integrated cross-modal formalization.

4.2 Ablation Study

Robustness under cross-modal ambiguity.

Table 2 reports final solver-executed accuracy under different relations between text-mentioned points and diagram points. The SFT-only formalizer performs strongly when the textual point set is complete, but its accuracy decreases in the diagram-augmented and cross-modal-mismatch regimes. This suggests that cross-modal ambiguity is not merely a post-processing issue: missing, redundant, or inconsistently grounded entities can directly change the executable predicates provided to the symbolic solver.

Effect of SG-RL and fallback repair.

Introducing SG-RL improves the overall accuracy from 74.9% to 79.5%, indicating that solver-derived feedback provides a useful supervisory signal beyond predicate-level imitation. The improvement is particularly pronounced in the cross-modal-mismatch setting, where accuracy increases from 67.2% to 74.8%. This result suggests that execution feedback helps the model prefer formalizations that are not only syntactically valid but also more compatible with downstream symbolic reasoning. Adding fallback repair further raises the overall accuracy to 84.5%, with a notable gain in the diagram-augmented setting from 77.1% to 84.3%. These results show that bounded execution-aware regeneration can recover from invalid, incomplete, or weakly grounded formalizations.

Effect of theorem proposal.

The final row of Table 2 incorporates the solver-driven theorem-proposing agent, increasing the overall accuracy to 86.4%. After formalization errors are reduced, some remaining failures are caused by the limited coverage of the fixed theorem library rather than by incorrect parsing alone. By observing the solver state and proposing verifiable auxiliary lemmas, the theorem-proposing agent extends the effective reasoning coverage of the symbolic back-end while preserving verification constraints. Importantly, the proposed lemmas are not accepted as unconstrained neural guesses; they must be checked by the solver before contributing to the final derivation.

Effect of task-specific formalization training.

Table 3 compares prompt-based formal-language generation by general-purpose MLLMs with formalizers adapted to the target geometry language. The results show that task-specific adaptation produces more reliable executable formalizations, especially under the syntax, grounding, and predicate constraints imposed by the downstream solver. This supports the need for dedicated formalization training rather than relying solely on prompt-based generation from general-purpose multimodal models.

Table 2: Fine-grained end-to-end accuracy under different text-diagram point relations. Text-complete means the point set mentioned in the problem text equals the diagram point set. Diagram-augmented means the text-mentioned point set is a strict subset of the diagram point set. Cross-modal mismatch covers the remaining cases.

Variant	Text-complete ( $n=142$ )	Diagram-augmented ( $n=262$ )	Cross-modal mismatch ( $n=197$ )	Overall ( $n=601$ )
Prompted MLLM baseline	0.7	0.3	0	0.3
Solver-driven autoformalizer: SFT only	87.2	73.7	67.2	74.9
+ SG-RL solver feedback	90.8	77.1	74.8	79.5
+ bounded fallback repair	94.3	84.3	77.7	84.5
+ theorem-proposing agent	-	-	-	86.4

Table 3: End-to-end solver execution accuracy using model-generated formal representations. All models generate formal representations that are executed by the same symbolic solver. General-purpose MLLMs are evaluated with the same prompt. The SD-GPS rows use QwenVL3-2B as the backbone; the QwenVL3-2B row denotes the prompted base model without solver-driven adaptation.

Model Name	Accuracy (%)
Prompted QwenVL3-2B	0.3
Prompted Qwen2.5-VL-32B	18.1
Prompted InternVL3-78B	19.3
Prompted GPT-4o	22.6
Prompted GLM-4.6V	22.6
Prompted Kimi-K2.5	28.5
Ours (QwenVL3-2B + SFT)	74.9
Ours (QwenVL3-2B + SG-RL)	79.5

5 Limitations

Our evaluation is limited to plane-geometry problems that can be represented with an Inter-GPS-style formal language. Extending the framework to three-dimensional geometry, construction-based problems, or free-form proof generation may require a richer predicate vocabulary and additional symbolic verification rules.

The method also depends on the coverage of the symbolic solver. Solver feedback is useful for training and verification, but solver failure may arise from missing theorem rules or limited search depth rather than incorrect formalization. Similarly, the theorem-proposing agent is constrained by the verifier: its proposed auxiliary relations are accepted only when they pass symbolic checks and help reduce the unresolved goal. The agent should therefore be interpreted as a bounded proposal mechanism for theorem instances and auxiliary relations, not as a replacement for formal proof certification.

6 Conclusion

We presented SD-GPS, a solver-driven framework for verifiable geometry problem solving. Instead of treating autoformalization as a static translation task and theorem prediction as a closed-library selection problem, SD-GPS uses symbolic execution as an active supervisory signal. A unified multimodal formalizer produces solver-ready predicates from raw diagram-text inputs, solver-verified reinforcement learning aligns training with downstream executability, and an iterative theorem-proposing agent expands the solver’s deductive reach through verified lemma proposals. Experiments on Geometry3K and PGPS9K show consistent improvements over MLLM, neural, and neuro-symbolic baselines. These results suggest that closing the loop between perception and symbolic verification is a promising direction for robust multimodal mathematical reasoning. Future work will extend the framework to broader formal languages, stronger proof-certificate generation, and more diverse geometric domains beyond plane-geometry benchmarks.

References

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q. Wang, Y. Wang, T. Xie, Y. Xu, H. Xu, J. Xu, Z. Yang, M. Yang, J. Yang, A. Yang, B. Yu, F. Zhang, H. Zhang, X. Zhang, B. Zheng, H. Zhong, J. Zhou, F. Zhou, J. Zhou, Y. Zhu, and K. Zhu (2025) Qwen3-vl technical report. External Links: 2511.21631, Link Cited by: §3.3, §4, Table 1.
B. Buchberger (1976) A theoretical basis for the reduction of polynomials to canonical forms. 10 (3), pp. 19–29. External Links: ISSN 0163-5824, Link, Document Cited by: §2.
J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin (2022) GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. External Links: 2105.14517, Link Cited by: §1, §4, Table 1.
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. (2024) Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198. Cited by: §4, Table 1.
J. Cheng, Z. Zhang, R. Chen, J. Deng, Z. Qin, and J. Ma (2025) GeoUni: a unified model for generating geometry diagrams, problems and problem solutions. External Links: 2504.10146, Link Cited by: §2.
Y. Chervonyi, T. H. Trinh, M. Olšák, X. Yang, H. Nguyen, M. Menegali, J. Jung, J. Kim, V. Verma, Q. V. Le, and T. Luong (2025) Gold-medalist performance in solving olympiad geometry with alphageometry2. External Links: 2502.03544, Link Cited by: §1.
DeepSeek-AI (2025) DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. External Links: 2501.12948, Link Cited by: §2.
J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. HONG, J. Han, H. Xu, Z. Li, and L. Kong (2025) G-LLaVA: solving geometric problem with multi-modal large language model. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §4, Table 1.
H. Gelernter, J. R. Hansen, and D. W. Loveland (1960) Empirical explorations of the geometry theorem machine. In Papers Presented at the May 3-5, 1960, Western Joint IRE-AIEE-ACM Computer Conference, IRE-AIEE-ACM ’60 (Western), New York, NY, USA, pp. 143–149. External Links: ISBN 9781450378697, Link, Document Cited by: §1.
Y. Hao, M. Zhang, F. Yin, and L. Huang (2022) PGDP5K: a diagram parsing dataset for plane geometry problems. In 2022 26th international conference on pattern recognition (ICPR), pp. 1763–1769. Cited by: §2.
Z. He, Z. Lyu, D. Chen, D. Guo, and Y. R. Fung (2025) MATP-bench: can mllm be a good automated theorem prover for multimodal problems?. External Links: 2506.06034, Link Cited by: §2.
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin (2025) Vision-r1: incentivizing reasoning capability in multimodal large language models. External Links: 2503.06749, Link Cited by: §4, Table 1.
M. Khomiakov, M. R. Andersen, and J. Frellsen (2024) GeoFormer: a multi-polygon segmentation transformer. External Links: 2411.16616, Link Cited by: §4, Table 1.
Z. Li, M. Zhang, F. Yin, and C. Liu (2024) LANS: a layout-aware neural solver for plane geometry problem. In Findings of the Association for Computational Linguistics: ACL 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), Bangkok, Thailand, pp. 2596–2608. External Links: Link, Document Cited by: §4, Table 1.
P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S. Zhu (2021) Inter-gps: interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6774–6786. Cited by: §A.1, §1, §1, §2, §4, §4, §4, Table 1.
M. Ning, Q. Wang, K. Huang, and X. Huang (2023) A symbolic character-aware model for solving geometry problems. External Links: 2308.02823, Link Cited by: §4, Table 1.
OpenAI (2024) GPT-4o system card. External Links: 2410.21276, Link Cited by: §4, Table 1.
S. Peng, D. Fu, Y. Liang, L. Gao, and Z. Tang (2023) Geodrl: a self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13468–13480. Cited by: §A.1, §1, §2, §2, §4, Table 1.
B. Ping, M. Luo, Z. Dang, C. Wang, and C. Jia (2026) AutoGPS: automated geometry problem solving via multimodal formalization and deductive reasoning. External Links: 2505.23381, Link Cited by: §A.1, §1, §1, §2, §2, §4, §4, Table 1, Table 1.
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2025) Qwen2.5 technical report. External Links: 2412.15115, Link Cited by: §4, Table 1.
M. Sachan and E. Xing (2017) Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), N. Ide, A. Herbelot, and L. Màrquez (Eds.), Vancouver, Canada, pp. 251–261. External Links: Link, Document Cited by: §1.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. External Links: 2402.03300, Link Cited by: §2.
K. Team, T. Bai, Y. Bai, Y. Bao, S. H. Cai, Y. Cao, Y. Charles, H. S. Che, C. Chen, G. Chen, H. Chen, J. Chen, J. Chen, J. Chen, J. Chen, K. Chen, L. Chen, R. Chen, X. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, Z. Chen, D. Cheng, M. Chu, J. Cui, J. Deng, M. Diao, H. Ding, M. Dong, M. Dong, Y. Dong, Y. Dong, A. Du, C. Du, D. Du, L. Du, Y. Du, Y. Fan, S. Fang, Q. Feng, Y. Feng, G. Fu, K. Fu, H. Gao, T. Gao, Y. Ge, S. Geng, C. Gong, X. Gong, Z. Gongque, Q. Gu, X. Gu, Y. Gu, L. Guan, Y. Guo, X. Hao, W. He, W. He, Y. He, C. Hong, H. Hu, J. Hu, Y. Hu, Z. Hu, K. Huang, R. Huang, W. Huang, Z. Huang, T. Jiang, Z. Jiang, X. Jin, Y. Jing, G. Lai, A. Li, C. Li, C. Li, F. Li, G. Li, G. Li, H. Li, H. Li, J. Li, J. Li, J. Li, L. Li, M. Li, W. Li, W. Li, X. Li, X. Li, Y. Li, Y. Li, Y. Li, Y. Li, Z. Li, Z. Li, W. Liao, J. Lin, X. Lin, Z. Lin, Z. Lin, C. Liu, C. Liu, H. Liu, L. Liu, S. Liu, S. Liu, S. Liu, T. Liu, T. Liu, W. Liu, X. Liu, Y. Liu, Y. Liu, Y. Liu, Y. Liu, Y. Liu, Z. Liu, Z. Liu, E. Lu, H. Lu, Z. Lu, J. Luo, T. Luo, Y. Luo, L. Ma, Y. Ma, S. Mao, Y. Mei, X. Men, F. Meng, Z. Meng, Y. Miao, M. Ni, K. Ouyang, S. Pan, B. Pang, Y. Qian, R. Qin, Z. Qin, J. Qiu, B. Qu, Z. Shang, Y. Shao, T. Shen, Z. Shen, J. Shi, L. Shi, S. Shi, F. Song, P. Song, T. Song, X. Song, H. Su, J. Su, Z. Su, L. Sui, J. Sun, J. Sun, T. Sun, F. Sung, Y. Tai, C. Tang, H. Tang, X. Tang, Z. Tang, J. Tao, S. Teng, C. Tian, P. Tian, A. Wang, B. Wang, C. Wang, C. Wang, C. Wang, D. Wang, D. Wang, D. Wang, F. Wang, H. Wang, H. Wang, H. Wang, H. Wang, H. Wang, J. Wang, J. Wang, J. Wang, K. Wang, L. Wang, Q. Wang, S. Wang, S. Wang, S. Wang, W. Wang, X. Wang, X. Wang, Y. Wang, Y. Wang, Y. Wang, Y. Wang, Y. Wang, Y. Wang, Z. Wang, Z. Wang, Z. Wang, Z. Wang, Z. Wang, Z. Wang, C. Wei, M. Wei, C. Wen, Z. Wen, C. Wu, H. Wu, J. Wu, R. Wu, W. Wu, Y. Wu, Y. Wu, Y. Wu, Z. Wu, C. Xiao, J. Xie, X. Xie, Y. Xie, Y. Xin, B. Xing, B. Xu, J. Xu, J. Xu, J. Xu, L. H. Xu, L. Xu, S. Xu, W. Xu, X. Xu, X. Xu, Y. Xu, Y. Xu, Y. Xu, Z. Xu, Z. Xu, J. Yan, Y. Yan, G. Yang, H. Yang, J. Yang, K. Yang, N. Yang, R. Yang, X. Yang, X. Yang, Y. Yang, Y. Yang, Y. Yang, Z. Yang, Z. Yang, Z. Yang, H. Yao, D. Ye, W. Ye, Z. Ye, B. Yin, C. Yu, L. Yu, T. Yu, T. Yu, E. Yuan, M. Yuan, X. Yuan, Y. Yue, W. Zeng, D. Zha, H. Zhan, D. Zhang, H. Zhang, J. Zhang, P. Zhang, Q. Zhang, R. Zhang, X. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, Z. Zhang, C. Zhao, F. Zhao, J. Zhao, S. Zhao, X. Zhao, Y. Zhao, Z. Zhao, H. Zheng, R. Zheng, S. Zheng, T. Zheng, J. Zhong, L. Zhong, W. Zhong, M. Zhou, R. Zhou, X. Zhou, Z. Zhou, J. Zhu, L. Zhu, X. Zhu, Y. Zhu, Z. Zhu, J. Zhuang, W. Zhuang, Y. Zou, and X. Zu (2026a) Kimi k2.5: visual agentic intelligence. External Links: 2602.02276, Link Cited by: Table 1.
V. Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y. Wang, Y. Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, H. Li, J. Zhu, J. Chen, J. Xu, J. Xu, J. Chen, J. Lin, J. Chen, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang, Q. Zheng, R. Lyu, S. Tu, S. Yang, S. Meng, S. Zhong, S. Huang, S. Zhao, S. Xue, T. Zhang, T. Luo, T. Hao, T. Tong, W. Jia, W. Li, X. Liu, X. Zhang, X. Lyu, X. Zhang, X. Fan, X. Huang, Y. Xue, Y. Wang, Y. Wang, Y. Wang, Y. An, Y. Du, Y. Huang, Y. Niu, Y. Shi, Y. Wang, Y. Wang, Y. Yue, Y. Li, Y. Liu, Y. Zhang, Y. Wang, Y. Zhang, Z. Xue, Z. Du, Z. Hou, Z. Wang, P. Zhang, D. Liu, B. Xu, J. Li, M. Huang, Y. Dong, and J. Tang (2026b) GLM-4.5v and glm-4.1v-thinking: towards versatile multimodal reasoning with scalable reinforcement learning. External Links: 2507.01006, Link Cited by: Table 1.
T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong (2024) Solving olympiad geometry without human demonstrations. Nature. External Links: Document Cited by: §1.
W. Wu (1978) On the decision problem and the mechanization of theorem-proving in elementary geometry. 21 (2), pp. 159–172. Cited by: §2.
W. Wu (1986) Basic principles of mechanical theorem proving in elementary geometrics. 2 (3), pp. 221–252. External Links: ISSN 0168-7433, Link, Document Cited by: §1.
W. Wu, L. Zhang, J. Liu, X. Tang, Y. Wang, S. Wang, and Q. Wang (2024) E-gps: explainable geometry problem solving via top-down solver and bottom-up generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13828–13837. Cited by: §A.1, §1, §2, §4, Table 1.
Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jamnik, and C. Szegedy (2022a) Autoformalization with large language models. External Links: 2205.12615, Link Cited by: §1.
Y. Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy (2022b) Autoformalization with large language models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35, pp. 32353–32368. External Links: Link Cited by: §2.
J. Xiong, Q. Han, Y. Hsieh, H. Shen, H. Xin, C. Tao, C. Zhao, H. Zhang, T. Wu, Z. Zhang, H. Wang, Z. Wan, L. Kong, and N. Wong (2026) MMFormalizer: multimodal autoformalization in the wild. External Links: 2601.03017, Link Cited by: §2.
J. Zhang and Y. Moshfeghi (2024) GOLD: geometry problem solver with natural language description. External Links: 2405.00494, Link Cited by: §4, Table 1.
M. Zhang, Z. Li, F. Yin, L. Lin, and C. Liu (2024a) Fuse, reason and verify: geometry problem solving with parsed clauses from diagram. External Links: 2407.07327, Link Cited by: §4, Table 1.
M. Zhang, F. Yin, and C. Liu (2023) A multi-modal neural geometric solver with textual clauses parsed from diagram. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 3374–3382. Cited by: §1, §4, §4, §4, Table 1.
X. Zhang, N. Zhu, Y. He, J. Zou, Q. Huang, X. Jin, Y. Guo, C. Mao, Y. Li, Z. Zhu, D. Yue, F. Zhu, Y. Wang, Y. Huang, R. Wang, C. Qin, Z. Zeng, S. Xie, X. Luo, and T. Leng (2024b) FormalGeo: an extensible formalized framework for olympiad geometric problem solving. External Links: 2310.18021, Link Cited by: §1.
Z. Zhang, J. Cheng, J. Deng, L. Tian, J. Ma, Z. Qin, X. Zhang, N. Zhu, and T. Leng (2024c) Diagram formalization enhanced multi-modal geometry problem solver. External Links: 2409.04214, Link Cited by: §2.
J. Zhao, T. Zhang, J. Sun, M. Tian, and H. Huang (2025) Pi-gps: enhancing geometry problem solving by unleashing the power of diagrammatic information. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1526–1536. Cited by: §1, §2, §2, §4, §4, Table 1, Table 1.
C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin (2025) Group sequence policy optimization. External Links: 2507.18071, Link Cited by: §2.
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y. Cao, Y. Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y. He, T. Jiang, J. Luo, Y. Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y. Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang (2025) InternVL3: exploring advanced training and test-time recipes for open-source multimodal models. External Links: 2504.10479, Link Cited by: §4, Table 1.

Appendix A Technical appendices and supplementary material

A.1 Experimental and implementation details

The symbolic method Inter-GPS [Lu et al., 2021] and Auto-GPS [Ping et al., 2026] were reproduced using the authors’ open-source code, while results for GeoDRL [Peng et al., 2023] and E-GPS [Wu et al., 2024] were extracted from original publications due to code unavailability. All experiments were executed on an Intel Xeon Gold 6530 CPU platform in conjunction with two NVIDIA L40S GPUs. The trainable multimodal formalizer is initialized from QwenVL3-2B; the prompted QwenVL3-2B rows use the same base model without task-specific adaptation. A strict timeout threshold of 600 seconds was imposed on symbolic solvers, where any computation exceeding this duration was systematically categorized as resolution failure. For MLLM hyperparameters, we set max_tokens=4096, temperature=0.1, and top-p=1.0.

Backbone.

The SD-GPS formalizer is initialized from QwenVL3-2B, and all reported SD-GPS variants use this 2B multimodal backbone unless otherwise specified. The theorem-proposing agent uses Qwen3.6-Plus. The QwenVL3-2B entries in the baseline tables denote the prompted base model without solver-driven adaptation.

Input normalization.

For each example, the diagram is resized while preserving aspect ratio and is paired with the raw problem statement. We do not expose OCR tokens, detected bounding boxes, or manually curated point sets to SD-GPS during inference. Text is normalized only for Unicode variants of mathematical symbols and whitespace; numerical values, point labels, and algebraic expressions are otherwise kept unchanged to avoid leaking solver-side corrections into the formalizer.

Formal representation.

We adopt an InterGPS-style typed formal language to represent geometric problems in a solver-executable form. The formalizer emits a structured sequence of typed predicates, explicitly separating entity declarations, relational constraints, and query targets. Each problem instance is represented as a directed set of well-typed logical atoms over geometric primitives. This InterGPS-style representation enforces strict type consistency between geometric objects and operators, preventing invalid compositions such as applying length functions to angles or mixing segment-level and point-level relations. Before being passed to the solver, a lightweight structural parser performs validation of: arity constraints, type compatibility, duplicate entity declarations, illegal symbol usage, and unresolved references.

A.2 Inference-time verification and repair

At inference time, SD-GPS uses a bounded verify-repair loop. The first pass generates a candidate formalization. The parser and solver then return structured diagnostics, including missing entity references, ill-typed predicates, contradictory constraints, unsolved targets, and timeout states. If the candidate fails before a valid proof state is constructed, the model is prompted to repair only the formal predicates while preserving the original problem image and text. We allow at most two repair attempts per example. The final answer is accepted only when it is produced from a solver-executable formalization; otherwise the example is marked unresolved. This policy keeps the reported accuracy tied to verifiable execution rather than unconstrained natural-language rationales.

A.3 Three-way split of point-reference ambiguity

To analyze how much information is explicitly available from the problem text, we divide examples according to the relation between text-mentioned point names and diagram point instances. We first extract the point set $P_{\mathrm{text}}$ that is explicitly mentioned in the problem statement. The extractor is constrained by the diagram point vocabulary $P_{\mathrm{diag}}$ , so it can only return valid diagram point names and cannot infer unmentioned points from geometry or common sense. The split is then defined as follows:

•

Text-complete: $P_{\mathrm{text}}=P_{\mathrm{diag}}$ . The text explicitly names all diagram points needed by the visual instance.
•

Diagram-augmented: $P_{\mathrm{text}}\subset P_{\mathrm{diag}}$ . The text names only a strict subset of diagram points, so the diagram supplies additional entities required for formalization.
•

Cross-modal mismatch: all remaining cases, including examples where the text refers to an underspecified figure, region, or target whose executable meaning must be resolved from the diagram and the formal constraints together.

Figure 3 shows one representative example from each split. These examples are used only for qualitative illustration; the quantitative ablation in Table 2 is computed over the full test split.

A.4 Additional theorem-proposing agent details

The theorem-proposing agent is invoked only after the formalization is executable but the solver cannot close the proof within the timeout or theorem-search budget. The agent observes four signals: the normalized predicate set, the current solver agenda, the unresolved target, and the latest failure diagnostic. We serialize this information as a compact proof-state summary rather than a full search tree, which keeps the prompt short and prevents the model from relying on irrelevant derivations.

Each proposal follows a restricted schema:

\tau=(\mathrm{type},\mathrm{premises},\mathrm{conclusion},\mathrm{entities},\mathrm{rationale}).

(10)

The type field specifies whether the proposal is, for example, a similarity relation, cyclicity relation, equal-angle relation, proportionality relation, or auxiliary construction. The entity field must use only point, line, circle, or expression identifiers already declared in $\hat{\Lambda}$ , unless the proposal type explicitly permits a bounded auxiliary construction. This schema makes the proposal easy to parse and prevents free-form natural-language suggestions from entering the solver.

A proposal is accepted only if it passes four checks: well-typedness under the formal language, applicability to the current proof state, symbolic verification of the proposed conclusion from admissible premises, and goal reduction under the solver’s search procedure. Unverified proposals are discarded and do not affect the final answer. Verified proposals are treated as temporary lemmas scoped to the current problem instance rather than permanent additions to the global theorem library, and the verifier stores the corresponding local certificate or construction trace. This design allows the system to bridge instance-specific deductive gaps while preserving the solver’s role as the authority for correctness.

A.5 OCR failure modes in cascaded pipelines

The gap between fully automatic OCR-based pipelines and their non-OCR counterparts is not only a matter of small transcription noise. In geometry diagrams, labels are often tiny, slanted, partially occluded by arrows or tick marks, and visually entangled with nearby symbolic marks. Under these conditions, OCR errors frequently alter the symbolic identity of an entity or the algebraic form of a constraint, which then propagates directly into the formal language used by the downstream solver.

Figure 4 shows representative OCR failures on geometry diagram crops. The first two cases illustrate expression-level corruption: in panel (a), the length expression x + 1 is read as x1, effectively deleting the operator; in panel (b), 3x - 4 is misread as 3x’ - 4, injecting an extra stroke into the symbolic item. The remaining four cases illustrate label-level confusion. Lowercase letters can be mapped to digits or visually similar glyphs, e.g., b $\rightarrow$ 6 in panel (c), q $\rightarrow$ 9 in panel (d), the point label I $\rightarrow$ / in panel (e), and the line label l $\rightarrow$ e in panel (f).

These failures are especially damaging in cascaded geometry systems because OCR output is typically treated as an authoritative symbol table for subsequent parsing and reasoning. Once a segment length is written as x1 instead of x + 1, or a line label is changed from l to e, downstream modules no longer operate on a slightly noisy observation; they operate on a different formal problem. The symbolic solver can verify consistency only with respect to the corrupted predicates it receives, so it has very limited ability to recover from these front-end mistakes. This explains why OCR-based variants in the main results degrade much more sharply than end-to-end approaches that avoid explicit OCR as an intermediate bottleneck.