License: arXiv.org perpetual non-exclusive license
arXiv:2606.21604v1 [cs.LG] 19 Jun 2026

Learning to Place Guards by Reinforcement:
A Geo-Free Neural Policy for the Vertex-Guard
Art Gallery Problem

Domagoj Ševerdija dseverdi@mathos.hr Jurica Maltar jmaltar@mathos.hr Nathan Chappel nathan.s.chappel@gmail.com Domagoj Matijević domagoj@mathos.hr School of Applied Mathematics and Informatics, University J. J. Strossmayer of Osijek, Croatia
Abstract

Neural combinatorial optimization (NCO) has shown that policies trained by reinforcement can construct strong solutions to NP-hard problems directly from raw instances. What such a policy actually learns, as opposed to what its decoder expresses, remains much less clear. We study this distinction on the vertex-guard Art Gallery Problem, the NP-hard task of choosing polygon vertices from which to observe an entire region. A pointer-network policy is trained from a coverage-aware reward over its own rollouts under the constraint we call geo-free inference: at test time it sees only vertex coordinates, with no visibility computation and no geometric oracle. The policy places guards economically but leaves a tail of under-covered polygons that widens far beyond the training range. To locate the cause, we freeze the trained encoder and read its embeddings with a small single-shot classifier, still geo-free at inference. The classifier closes most of the feasibility gap, in and out of distribution and at up to roughly five times the training range, cutting under-covered polygons by about an order of magnitude at an explicitly reported cost in guard count. We read this as evidence that the reinforcement-trained representation already encodes the geometry required for feasibility, and that residual failures reflect decoder calibration rather than missing knowledge. Probing a frozen encoder thus offers a practical way to ask what a neural combinatorial solver has internalized.

keywords:
Art Gallery Problem , neural combinatorial optimization , reinforcement learning , pointer networks , representation probing

1 Introduction

The Art Gallery Problem (AGP) asks for the minimum number of guards needed to observe every point of a simple polygon [22, 28], with applications in surveillance, sensor placement, and robotics [13, 6]. We focus on the vertex-guard variant, in which guards must be chosen from the polygon’s vertices; the problem remains NP-hard and is a discrete instance of geometric set cover with non-uniform, polygon-dependent visibility regions. The question this paper investigates is whether a neural policy can learn to place such guards from coordinates alone, trained from a coverage-aware reward signal over its own rollouts, with no visibility matrix and no geometric oracle at the moment it chooses guards. We are not asking whether a learned solver can match a classical greedy heuristic on guard count — it cannot, and we do not claim it does. We are asking whether the underlying combinatorial-geometric task is learnable when the model is denied any explicit geometric input at test time.

The reinforcement method we evaluate is an LSTM pointer-network policy trained with Preference Optimization (PO) [29] on Bradley–Terry preference pairs [7] drawn from the policy’s own rollouts. The policy emits a vertex-index sequence with an EOS\mathrm{EOS} token; the reward couples coverage and guard usage. PO replaces REINFORCE advantages [33, 3] with binary preferences between pairs of policy rollouts, preserving a usable gradient signal even when the reward saturates under the coverage constraint [29]. We refer to inference under this policy as geo-free — the network reads only vertex coordinates at the moment it places guards, never querying a visibility oracle (defined precisely in Section 2.3).

The policy works, to a degree. On a held-out in-distribution split its guard sets are competitive in cardinality with classical baselines, but the per-polygon coverage distribution has a tail: a non-trivial fraction of polygons end up below the feasibility line one would want a deployed solver to clear, and the failure rate grows on out-of-distribution polygons whose vertex counts exceed the training range. The reinforcement method, by itself, is not feasibility-guaranteed.

To check whether this residual reflects a true limit of the reinforcement-learned representation or only a calibration problem in the decoder, we train a small per-vertex classifier conditioned on the policy’s frozen encoder embeddings — together with the vertex coordinates and a binary indicator of the policy’s own guard set, all geo-free — and predict a vertex-inclusion probability, controlled by a single scalar threshold. The classifier has no access to visibility information at inference, and it compresses the feasibility tail substantially both in and out of distribution. A no-encoder ablation that zeroes the embeddings while keeping the coordinates and seed indicator (Section 6.2) isolates the encoder’s contribution from these other inputs, and we read the gap as evidence that the reinforcement-trained encoder already carries enough geometric structure to support a feasibility decision — so the policy’s coverage tail reflects what its decoder expressed, not a limit of the representation. The cost of recovering feasibility is a controlled rise in guard count, which we report explicitly.

Contributions

This paper makes four contributions.

  • C1

    A representation–decoder decomposition for neural combinatorial optimization. We adapt the probing methodology of representation analysis [1, 16, 2] to a reinforcement-trained combinatorial solver: freezing the policy’s encoder and training a single-shot per-vertex probe over its embeddings (with the vertex coordinates and the policy’s seed) isolates what the learner represents from what its decoder expresses, with a 2×22\times 2 encoder–seed ablation and a zero-capacity linear probe as controls. To our knowledge this is the first such probing study of an RL-trained policy on a geometric covering problem.

  • C2

    Out-of-distribution generalization at scale. On a large held-out set spanning the training size range and extending to roughly five times beyond it, the probe raises the average fraction meeting the 0.950.95 feasibility gate across four probe seeds from 85%85\% (policy seed) to 99%99\%; the gain holds on the genuinely larger-than-training polygons alone (85%98.7%85\%\to 98.7\%, Section 6.3).

  • C3

    Geo-free inference as a protocol. Restricting test-time inputs to vertex coordinates and learned features derived from them — never a visibility oracle — isolates what a learner internalizes from what such an oracle could compute for it, making “no explicit geometric knowledge” a measurable property rather than a rhetorical one (Section 2.3).

  • C4

    A structural finding on post-processing. Learned iterative editors fail to improve the policy’s seed without dropping coverage — due to error accumulation, stop-time miscalibration, and train/inference drift — whereas the single-shot probe avoids all three by design and is empirically a fixed point: re-running it on its own output changes nothing (Section 6.4). That contrast validates the single-shot design as a clean reading of the encoder.

We do not claim to beat classical greedy or local-search heuristics on guard count; we claim, and document, that vertex-guard placement is learnable to a measurable degree by a reinforcement method that never reads an explicit visibility object at inference.

The remainder of the paper is organized as follows. Section 2 establishes notations and definitions. In Section 3 related work is reviewed. Section 4 describes the reinforcement method and the representation probe. Finally, Section 5 describes the experimental setup, Section 6 reports the results, followed by discussion in Section 7, limitations in Section 8, and the conclusion in Section 9.

2 Problem and Notation

2.1 Polygons and visibility

Let P2P\subset\mathbb{R}^{2} denote a simple polygonal region (the closed set consisting of its interior and boundary), with nn vertices at coordinates x1,,xn2x_{1},\dots,x_{n}\in\mathbb{R}^{2} and vertex index set V(P)={1,,n}V(P)=\{1,\dots,n\}. For two points p,qPp,q\in P we say that pp sees qq when the closed segment between them stays inside the polygon, pq¯P\overline{pq}\subseteq P. Guards are placed at vertices: for a vertex-guard set SV(P)S\subseteq V(P), the visibility region and area-normalized coverage are

VisP(S)={qP:iS,xi sees q},CovP(S)=Area(VisP(S))Area(P)[0,1].\mathrm{Vis}_{P}(S)\,=\,\{\,q\in P:\exists\,i\in S,\ x_{i}\text{ sees }q\,\},\qquad\mathrm{Cov}_{P}(S)\,=\,\frac{\mathrm{Area}(\mathrm{Vis}_{P}(S))}{\mathrm{Area}(P)}\in[0,1]. (1)

2.2 The vertex-guard variant

The vertex-guard Art Gallery Problem (AGPVG) asks for the smallest vertex-guard set that covers the whole polygon. We define the vertex-guard optimum directly:

OPT(P)=min{|S|:SV(P),CovP(S)=1}.\mathrm{OPT}(P)\,=\,\min\bigl\{|S|:S\subseteq V(P),\ \mathrm{Cov}_{P}(S)=1\bigr\}. (2)

Restricting guards to V(P)V(P) reduces placement to a finite discrete choice over nn vertices — the natural action space for a pointer-network policy [32, 3, 20, 21] — while retaining the full NP-hardness and non-uniform visibility structure of the problem.

Note that equation (2) is a constrained single-objective problem — minimize the guard count subject to the hard constraint CovP(S)=1\mathrm{Cov}_{P}(S)=1 — and is a discrete set-cover instance, where coverage is the submodular union of per-guard visibility regions. This is what separates it from the routing problems on which neural combinatorial solvers are usually studied, where any permutation is feasible and only a single scalar cost is minimized. A learner denied a visibility oracle at inference (the geo-free constraint, Section 2.3) cannot certify the coverage constraint while placing guards; it must instead trade coverage against guard count, so the single constrained objective becomes a two-axis operating curve.

2.3 Geo-free inference

We will say that an inference procedure is geo-free if its inputs at test time are limited to the polygon’s vertex coordinates and learned features derived from them — no polygon visibility matrix, no CGAL-computed visibility regions, no per-vertex area or visibility-fraction feature. Geo-free inference does not preclude using exact visibility during evaluation (to report coverage after the fact) or during training (to compute rewards, gate trajectories, or build supervised targets); it restricts what the model reads at the moment it places guards. A solver allowed to query a visibility oracle at every decision step — as classical local search and active-search decoding both do — can in principle simulate greedy or local search, and a measurement of what such a solver has learned would be conflated with a measurement of what its oracle can compute. The geo-free constraint isolates the first question from the second, which is what makes “no explicit geometric knowledge” a measurable property.

3 Related Work

This section places the present work in four lines of research: algorithms for the Art Gallery Problem (classical and learned), neural combinatorial optimization with reinforcement learning, supervised post-processing of learned solvers, and probing of learned representations.

The Art Gallery Problem

Lee and Lin [22] established NP-hardness for several variants of the AGP; O’Rourke [28] surveys the foundational geometric and combinatorial results. Work on the problem has traditionally followed three lines. Exact algorithms solve smaller instances to optimality, most effectively via integer programming: Couto et al. [10] compute exact vertex-guard optima this way, and we treat their published instance library [9], with its exact optima, as the source of polygons in our experiments. Approximation algorithms provide provable guarantees — Ghosh [14] gives a logarithmic-factor method for guarding simple polygons. Heuristics without theoretical guarantees, of which the classical greedy rule [15] that places guards to maximize marginal coverage is the most common, are reliable on small to mid-sized polygons. Because area coverage is a monotone submodular function of the guard set, this marginal-coverage rule is exactly the greedy submodular-maximization heuristic and inherits its diminishing-returns approximation guarantee [27]; the guarantee is available only to a solver that can evaluate marginal coverage while it selects, which a geo-free learner cannot (Section 2.3), so we forgo such guarantees by construction — the price of measuring what the representation has learned rather than what an oracle computes. The same exact/approximation/heuristic split recurs across the problem’s many variants — for instance, guarding 1.51.5-dimensional terrains admits constant-factor approximations [4, 12]. Local-search refinement of a candidate guard set by swapping or removing vertices is the standard high-quality post-processing step. We use classical greedy purely as a non-learned evaluation reference, not in competition with the reinforcement method on guard count.

Reinforcement learning has been applied to the AGP and to closely related coverage problems before, but in settings quite different from ours. Liao et al. [24] discretize the polygon into a grid and train a per-instance tabular Q-learning agent that trades guard count against covered area; Kaba et al. [19] cast the 3D view-planning problem — a sensor-coverage set-cover — as an MDP solved with value-based RL (SARSA, Q-learning, temporal-difference) guided by a geometry-aware score, outperforming the greedy baseline. Both methods consult coverage or visibility while placing guards and learn a fresh policy for each instance. We differ on three axes: the action space is the polygon’s own vertices rather than grid cells or candidate viewpoints; a single amortized policy generalizes to unseen polygons without retraining; and inference is geo-free, with no visibility computation at the moment guards are placed.

Reinforcement learning for combinatorial optimization

Pointer networks [32] introduced the architecture of attending over input tokens to produce variable-length output index sequences, and were extended to combinatorial optimization with reinforcement learning by Bello et al. [3] using REINFORCE [33] with a learned baseline; Deudon et al. [11] similarly train TSP construction heuristics by policy gradient. Subsequent work introduced attention-only architectures for routing problems [20], policy-optimization variants such as POMO [21] that exploit problem symmetries, and Transformer pointer policies such as Pointerformer [18] aimed at generalizing to large instances — a concern we share, since our evaluation stresses out-of-distribution polygon sizes. We adopt the more recent preference-optimization (PO) recipe of Pan et al. [29], which trains the policy with a Bradley–Terry [7] ranking loss over pairs of solutions sampled from the policy itself, replacing the variance-heavy REINFORCE objective. We treat PO as a reinforcement-style procedure: the loss is computed from policy rollouts on the environment (the polygon), the gradient direction is determined by the reward (coverage versus guard usage), and no supervised labels for the action sequence are used. A broader methodological survey is provided by Bengio et al. [5].

A recurring feature of this literature is worth making explicit, because AGP departs from it. Routing problems such as the TSP carry a single scalar objective and free feasibility — any permutation is a valid tour — and even constrained variants such as capacitated routing usually enforce feasibility by masking infeasible actions at decode time, which presumes an oracle for the constraint. Vertex-guard AGP has neither property: it is a constrained set-cover problem (Section 2.2), and under geo-free inference the coverage constraint cannot be masked in, so it can only be learned and traded off against guard count — the operating curve our threshold sweep traces. The closest learned analogues are RL methods that approximate Pareto frontiers for bi-objective routing [26, 23]; but those balance two soft costs, whereas one of our two axes is a hard feasibility constraint, making PA-Net’s Lagrangian relaxation of a constraint [26] a closer match than a symmetric Pareto front. On the optimization side, Caramanis et al. [8] prove that policy-gradient solution-samplers enjoy benign landscapes for a broad class of problems (Max-/Min-Cut, Max-kk-CSP, bipartite matching, TSP), but their guarantee assumes a single cost that is bilinear in instance and solution features and imposes no hard feasibility constraint. AGP’s guard-count term is linear and would fit, but its coverage term is the submodular union of visibility regions and its full-coverage requirement is a hard, geo-free-uncertifiable constraint — placing AGP outside their analyzed class and helping explain why a naive policy gradient struggles here (Section 4.1).

Supervised post-processing of learned solvers

An orthogonal line of work post-processes a base solver’s output with a second model trained by supervised or imitation learning. For combinatorial problems with hard feasibility constraints (such as covering), iterative supervised editors must contend with compounding errors and a train–inference distribution gap; DAgger [30] and its variants partially address the latter. We use a single-shot supervised post-processor (the SetPredictor) not as a competing solver but as a probe of the RL-trained encoder’s representation. The framing is methodological: if a small classifier reading frozen RL embeddings can recover feasibility, the RL encoder must already carry the relevant geometric information. Directly cloning search solutions with a recurrent decoder, by contrast, fails on this problem for a structural reason — the teacher-forcing cascade we return to in Section 6.4 — which is why we read the frozen encoder with a single-shot, recurrence-free probe rather than a sequential supervised decoder.

Probing learned representations

The methodology of freezing a trained network and reading its internal representations with a small supervised classifier originates in the analysis of vision and language models: linear classifier probes diagnose what intermediate layers encode [1], control tasks calibrate how much of a probe’s accuracy reflects the representation rather than the probe’s own capacity [16], and Belinkov [2] surveys the family. We adapt this methodology to neural combinatorial optimization, where it has seen little use: the probed model is a reinforcement-trained policy rather than a self-supervised encoder, the probed property is a geometric feasibility signal (guard membership) rather than linguistic structure, and the geo-free constraint fixes what the probe may read at inference. Our no-encoder ablation (Section 6.2) plays the role of the control task: it keeps the probe’s capacity and training recipe constant while removing the representation, so the performance gap is attributable to the encoder.

4 Method

We describe the reinforcement method (Section 4.1) and the single-shot representation probe (Section 4.2). The empirical diagnostic that most motivates the single-shot design, i.e., why iterative editors fail to improve the seed, is deferred to the results (Section 6.4).

4.1 The reinforcement method: PO/BT pointer policy

Architecture

The policy is a compact pointer network with a single-layer, unidirectional LSTM [17] encoder and an attention-based decoder. A polygon PP with vertex coordinates (x1,,xn)(x_{1},\dots,x_{n}) is consumed coordinate-wise (no visibility input). Running the LSTM over the vertex sequence yields per-vertex hidden states that serve as the embeddings (h1,,hn)(h_{1},\dots,h_{n}), hidh_{i}\in\mathbb{R}^{d}, where dd is the encoder embedding dimension (its value, and those of all dimensions introduced below, is set in Section 5.2). The decoder attends over the encoder at each step, emits a vertex index or an end-of-sequence - EOS\mathrm{EOS} token, and stops at EOS\mathrm{EOS} or when the maximum length is reached. The decoder outputs a sequence π=(π1,,π|π|)\pi=(\pi_{1},\dots,\pi_{|\pi|}) while the resulting guard set is S(π)={πt:πtEOS}S(\pi)=\{\pi_{t}:\pi_{t}\neq\mathrm{EOS}\}. All metrics, i.e., coverage, |S|/n|S|/n, |S|/OPT|S|/\mathrm{OPT}, are computed from S(π)S(\pi). We denote the joint encoder–decoder distribution by pθ(πP)p_{\theta}(\pi\mid P), where θ\theta collects all policy parameters.

The encoder has no training signal of its own: its weights change only because the reward gradient back-propagates from the decoder’s rollouts. If the encoder already holds the geometric information needed for feasibility, a small classifier reading its frozen embeddings should be able to recover the guard set the decoder failed to express. This is why we probe the encoder inside the trained policy by freezing it and attaching a small classifier. By contrast, a standalone model, i.e., an encoder + a probe trained from scratch, would reflect its own learning, not what the reinforcement policy internalized. The policy’s greedy decode additionally supplies the seed guard set that the representation probe (Section 4.2) later refines.

Why an autoregressive pointer

Vertex-guard placement is a set-cover problem: whether a vertex should be a guard depends on what the guards already chosen leave uncovered. An autoregressive pointer captures this dependency naturally by conditioning each pick on the partial solution (Eq. (5)), and it decides cardinality through the EOS\mathrm{EOS} token rather than a fixed threshold. A model scoring vertices independently would miss this structure — which is also why the probe (Section 4.2) is not a standalone solver: unable to be autoregressive, it takes the policy’s greedy seed as an input instead.

The reason the decoder is necessary once the probe works, i.e., why we cannot discard it, is that the encoder’s representation was shaped entirely by the reward gradient flowing through the decoder, and the probe reads the decoder’s seed as a feature. Removing the decoder removes both the representation and the probe’s input. What we compare is encoder-and-decoder against encoder-and-probe, not the decoder against no-decoder. Reading the frozen encoder with a probe also recovers feasibility on polygons far larger than those seen in training (Section 6.3), which autoregressive pointer policies alone cannot do [32, 18].

Reward and rollouts

For each polygon PP we sample RR rollouts from the policy and score each rollout’s guard set S(π)S(\pi) by a scalar utility that gates coverage at τ\tau and then prefers smaller guard sets,

u(πP)=min(CovP(S(π)),τ)λ|S(π)||V(P)|ρmax(0,τCovP(S(π))),u(\pi\mid P)\,=\,\min\!\big(\mathrm{Cov}_{P}(S(\pi)),\,\tau\big)\;-\;\lambda\,\frac{|S(\pi)|}{|V(P)|}\;-\;\rho\,\max\!\big(0,\,\tau-\mathrm{Cov}_{P}(S(\pi))\big), (3)

where τ=0.99\tau=0.99 is the coverage gate, λ=1.0\lambda=1.0 weights guard usage, and ρ=3.0\rho=3.0 penalizes coverage below the gate (Section 5.2). Capping the coverage term at τ\tau removes any incentive to over-cover, so once a rollout is feasible the utility is driven purely by cardinality. Coverage during training is computed with the discretized-visibility sampler described next (M=500M=500 interior points per polygon), a cheap approximation of the exact coverage area.

Discretized-visibility coverage

Computing the exact coverage of a guard set — the area of the union of its vertices’ visibility polygons — is too costly to repeat for the many rollouts and local-search candidates scored per polygon, so during training we estimate it by Monte-Carlo sampling. For each polygon we draw M=500M=500 points uniformly from its interior (rejection sampling within the bounding box, under a fixed per-polygon seed) and, for each vertex ii, compute its exact visibility polygon VisP({i})\mathrm{Vis}_{P}(\{i\}) once with CGAL triangular-expansion visibility [31]. Recording which samples fall inside each visibility polygon yields a boolean visibility matrix B{0,1}|V(P)|×MB\in\{0,1\}^{|V(P)|\times M} with Bis=1B_{is}=1 iff sample ss is visible from vertex ii, and the coverage of a guard set SS is estimated by the covered-sample fraction

CovP(S)¯=1M|{s:iS,Bis=1}|,\overline{\mathrm{Cov}_{P}(S)}\,=\,\frac{1}{M}\,\Big|\big\{\,s:\exists\,i\in S,\ B_{is}=1\,\big\}\Big|, (4)

a uniform Monte-Carlo estimate of the area ratio CovP(S)\mathrm{Cov}_{P}(S) of Eq. (1), evaluated as a bitwise OR over the rows of BB. Because BB depends only on the polygon, it is computed once and cached, so every subsequent coverage query during a rollout or a local-search edit costs an O(|S|M)O(|S|\,M) bitwise operation rather than a polygon-union computation. The per-vertex visibility polygons are exact; the only approximation is replacing the area integral by a sample average, whose standard error decays as O(1/M)O(1/\sqrt{M}). Exact coverage — the area of the union, integrated rather than sampled — is reserved for evaluation (Section 5.3).

Preference-optimization loss

We score a rollout by its length-normalized log-likelihood — the mean per-token log-probability under the factorization

pθ(πP)=tpθ(πtπ<t,P)p_{\theta}(\pi\mid P)=\prod_{t}p_{\theta}(\pi_{t}\mid\pi_{<t},P)

,

¯θ(πP)=1|π|t=1|π|logpθ(πtπ<t,P),\bar{\ell}_{\theta}(\pi\mid P)\,=\,\frac{1}{|\pi|}\sum_{t=1}^{|\pi|}\log p_{\theta}(\pi_{t}\mid\pi_{<t},P), (5)

which keeps the preference from being biased by guard-set size, since AGP solutions have variable length (cf. Pan et al. [29]). For each pair (π+,π)(\pi^{+},\pi^{-}) of rollouts with u(π+P)>u(πP)u(\pi^{+}\mid P)>u(\pi^{-}\mid P) above a coverage gate, we apply the Bradley–Terry log-likelihood

BT(θ)=𝔼(P,π+,π)[logσ(α[¯θ(π+P)¯θ(πP)])],\mathcal{L}_{\mathrm{BT}}(\theta)\,=\,-\mathbb{E}_{(P,\pi^{+},\pi^{-})}\left[\,\log\sigma\!\big(\alpha\,[\,\bar{\ell}_{\theta}(\pi^{+}\mid P)-\bar{\ell}_{\theta}(\pi^{-}\mid P)\,]\big)\,\right], (6)

where α>0\alpha>0 is a temperature scaling the preference margin (α=0.05\alpha=0.05; Section 5.2). The gradient direction is driven purely by which of the two rollouts achieved a higher reward; no supervised target sequence is used.

The choice of PO over REINFORCE is deliberate. Once a rollout meets the coverage constraint, the reward of Eq. (3) flattens and the policy-gradient signal on guard count vanishes — a saturation effect specific to hard coverage constraints, not covered by benign-landscape guarantees for unconstrained problems [8]. We observe this failure directly: a value-critic REINFORCE policy over-guards by roughly 3.5×3.5\times the optimum and a greedy-baseline variant by about 7×7\times (Table 1). Bradley–Terry preferences between rollout pairs remain informative even when both rollouts saturate, since the signal depends on which rollout is relatively better, not on the absolute reward level. PO/BT reaches comparable coverage at 1.3×1.3\times the optimum. At the end of training the encoder is frozen and treated as a fixed feature extractor for the rest of the paper; Figure 1 shows the late-phase policy holding a stable operating point — good coverage at low guard cost — rather than collapsing onto a single axis.

Training objective Mean cov Mean |S|/OPT|S|/\mathrm{OPT}
REINFORCE, learned-critic baseline 0.904 3.49
REINFORCE, greedy rollout baseline 1.000 7.00
PO / Bradley–Terry (ours) 0.978 1.30
Table 1: Training-objective comparison on the combined dev++test splits (greedy decode). Coverage is geometric polygon coverage; |S|/OPT|S|/\mathrm{OPT} is the mean approximation ratio. The REINFORCE rows are 3030-epoch runs evaluated under an earlier version of the evaluation protocol; they are reported as evidence for the choice of training objective, not as tuned, protocol-matched competitors. The PO/BT row is the policy used throughout the paper.
Refer to caption
Figure 1: PO/BT training dynamics over the four late checkpoints (epochs 110110, 114114, 160160, 200200), evaluated on a fixed held-out sample of 100100 polygons; per-epoch logging was not preserved, so early training is not shown (Section 8). Mean per-polygon coverage (left axis, zoomed to [0.95,1.0][0.95,1.0]) and |S|/OPT|S|/\mathrm{OPT} (right axis) of the greedy-decoded policy stay within a narrow band with no sign of collapse; the movement is non-monotone rather than a clean trend, which is all four checkpoints can establish.

4.2 Probing the representation: a single-shot per-vertex classifier

The policy learns to decrease the cardinality but leaves a feasibility tail (Section 6.1). We first verified that this tail is reachable: running a local-search editor (add/remove/swap) on the policy’s own greedy seed restores feasibility on every sub-feasibility in-distribution polygon while reducing the guard count (the LS on policy seed row of Table 3). We treat this as a sanity check that settles the diagnosis — the missing feasibility is a handful of local edits away from the seed, so it reflects decoder calibration rather than absent geometry — and that, as a by-product, yields a concrete target set SLSS^{\star}_{\mathrm{LS}}. The open question is then whether that refinement can be learned from the frozen encoder, geo-free. A learned iterative editor over the seed cannot: it does not close the tail in any coverage-preserving regime (Section 6.4), because error accumulation, stop-time miscalibration, and train/inference drift make a rollout-based post-processor unreliable here, contrary to fine-tuning arguments of the policy in [29]. We therefore probe what the policy has learned with a single-shot, rollout-free classifier that recovers the feasibility decision the decoder failed to make. Its inputs are geo-free — the frozen encoder embedding, the vertex coordinates, and the policy’s seed indicator — with no visibility object at inference. Having no rollout and no stop decision, a single pass has nothing to drift and nothing to miscalibrate. We call it the SetPredictor and present it primarily as a representation diagnostic; the threshold sweep in Section 6 makes the diagnostic concrete, but the operating-curve framing is a side-effect of the measurement, not a deployment recipe.

For polygon PP with vertex coordinates x1,,xnx_{1},\dots,x_{n}, the input feature for vertex ii is

zi=[hi;xi; 1{iS(πseed)}]d+3,z_{i}\,=\,[\,h_{i}\,;\,x_{i}\,;\,\mathbf{1}\{i\in S(\pi_{\mathrm{seed}})\}\,]\,\in\,\mathbb{R}^{d+3}, (7)

that is, the frozen dd-dimensional encoder embedding hih_{i} from the PO/BT policy, the 2D coordinates xix_{i}, and a single binary indicator for whether vertex ii is selected by the policy’s greedy-decoded seed πseed\pi_{\mathrm{seed}}.

The architecture is a small Transformer encoder (parameters ϕ\phi) over the nn vertex tokens with a sigmoid output head per token (sizes given in Section 5.2). The features ziz_{i} are linearly projected to width dd and passed through LL pre-norm self-attention blocks — each with HH-head self-attention at width dd, a GELU feed-forward sublayer of expansion factor ff, and residual connections — followed by a final layer normalization; we write viv_{i} for the resulting per-vertex hidden state. This viv_{i} is the SetPredictor’s own encoding of vertex ii, and is distinct from the frozen policy embedding hih_{i}, which enters only as part of the input feature ziz_{i} in Eq. (7). From the (v1,,vn)(v_{1},\dots,v_{n}) we build two context vectors by mean-pooling: a seed context cseed=1|S(πseed)|iS(πseed)vic_{\mathrm{seed}}=\tfrac{1}{|S(\pi_{\mathrm{seed}})|}\sum_{i\in S(\pi_{\mathrm{seed}})}v_{i}, averaged over the vertices the policy placed in its seed, and a global context cglob=1ni=1nvic_{\mathrm{glob}}=\tfrac{1}{n}\sum_{i=1}^{n}v_{i}, averaged over all vertices of the polygon. Both are single vectors shared across vertices. Each per-vertex prediction concatenates the vertex’s own viv_{i} with these two context vectors and passes the result through a 2-layer GELU MLP head (3dd13d\to d\to 1):

p^i=σ(MLP([vicseedcglob])),i=1,,n.\hat{p}_{i}\,=\,\sigma\!\big(\mathrm{MLP}([\,v_{i}\,\|\,c_{\mathrm{seed}}\,\|\,c_{\mathrm{glob}}\,])\big),\qquad i=1,\dots,n. (8)

The output p^i[0,1]\hat{p}_{i}\in[0,1] is interpreted as the probability that vertex ii should be in the final guard set. Thresholding at t(0,1)t\in(0,1) yields the guard set

S(t)={i{1,,n}:p^it}.S(t)\,=\,\{\,i\in\{1,\dots,n\}:\hat{p}_{i}\geq t\,\}. (9)

Objective

The probe is trained by supervised learning to predict membership in an LS-improved target set SLS(P)S^{\star}_{\mathrm{LS}}(P) — the policy’s seed refined by local search, constructed as described in Section 5.2 — so it is not itself a reinforcement learner. We supervise on this refinement rather than on the exact optimum for two reasons: guard membership at the optimum is non-unique — the constraint in Eq. (2) fixes cardinality, not which vertices are chosen (Section 6.4) — so an optimal set is an ill-posed per-vertex label, and the optimum is in any case unavailable at scale (Section 5.3); the LS endpoint, by contrast, is a well-defined, feasibility-restoring refinement of the policy’s own seed. With yi=𝟏{iSLS(P)}y_{i}=\mathbf{1}\{i\in S^{\star}_{\mathrm{LS}}(P)\}, a positive-class weight w+w_{+} that balances the empirical non-guard-to-guard ratio, and \mathcal{B} a mini-batch of polygons, the per-vertex weighted binary cross-entropy is

BCE(ϕ)=1P|V(P)|Pi=1|V(P)|[w+yilogp^i+(1yi)log(1p^i)].\mathcal{L}_{\mathrm{BCE}}(\phi)\,=\,-\frac{1}{\sum_{P\in\mathcal{B}}|V(P)|}\sum_{P\in\mathcal{B}}\sum_{i=1}^{|V(P)|}\big[\,w_{+}y_{i}\log\hat{p}_{i}+(1-y_{i})\log(1-\hat{p}_{i})\,\big]. (10)

Whether closing that tail is the encoder’s doing rather than the probe’s own capacity is settled not here but by the controls of Section 6.2: the no-encoder ablation and a zero-capacity linear probe.

Inference and the threshold knob

At inference we run the policy encoder once on the coordinates, greedy-decode the seed πseed\pi_{\mathrm{seed}}, form the features ziz_{i} of Eq. (7), run one SetPredictor forward pass to obtain (p^1,,p^n)(\hat{p}_{1},\dots,\hat{p}_{n}), and threshold to return S(t)S(t) (Eq. (9)). The policy seed, the local-search editor, and the discretized-visibility scoring are all training-time scaffolding used only to build the labels SLSS^{\star}_{\mathrm{LS}}; at inference the method is this single geo-free forward pass — no seed-refinement loop, no local search, no visibility object — so LS is part of how the probe is trained. The threshold tt is the single knob trading coverage for guard count, swept over t{0.20,0.25,0.30}t\in\{0.20,0.25,0.30\} in Section 6.2. Re-running the probe with its own output as the updated seed indicator changes nothing — it is empirically a fixed point after the first pass (Section 6.4) — so we use a single pass throughout.

5 Experiments

5.1 Dataset and partition

We draw polygons primarily from the Art Gallery Problem with Vertex Guards (AGPVG) instance library of Couto, de Rezende, and de Souza [9, 10], which collects several polygon families — random orthogonal (including large-area fat and small-area min variants), random simple (randsimple), random von Koch (randvon), complete von Koch (vonkoch), and boundary-simple (simple) — over vertex counts nn ranging from 88 to 2,5002{,}500. The library distributes vertex-guard optimal solutions for its instances, which we use directly as OPT\mathrm{OPT}.

We supplement the library with 313313 additional random simple polygons (random class, n{20,40,,600}n\in\{20,40,\dots,600\}, 3030 instances per size) to increase training density for random simple polygons in the in-distribution range; these share the same rational-coordinate format, and their vertex-guard optima were computed by solving the vertex-guard set-cover integer program. The splits we work with are listed in Table 2.111The 70/3070/30 dev//test partition uses a seeded random shuffle (seed 12341234) rather than an alphabetical-name split due to instance naming

Split # polygons nn range Use
train 8,867 8–198 Pointer + SetPredictor training
dev 840 16–198 SetPredictor checkpoint selection
test 362 8–192 Held-out evaluation (in-distribution)
ood 2,081 8–1,000 Out-of-distribution evaluation
ood-large 285 600–2,250 Extreme-OOD evaluation
Table 2: Dataset partition. The dev and test splits share the same source library files, partitioned 70/3070/30 by seeded random shuffle (seed 12341234): dev is used for checkpoint selection and test is the in-distribution held-out evaluation. The ood split is a larger held-out set spanning the training size range and extending well beyond it, to n=1000n=1000 (about five times the training maximum); its genuinely larger-than-training polygons (n>198n>198) are analyzed separately in Section 6.3. The ood-large split is used for an extreme-OOD evaluation (Table 8).

5.2 Training setup

The PO/BT policy is the existing lstm_bt checkpoint: trained once from a single seed (12341234) for 200200 epochs, with R=8R=8 rollouts per polygon, Bradley–Terry temperature α=0.05\alpha=0.05, and reward weights λ=1.0\lambda=1.0 and ρ=3.0\rho=3.0 at the gate τ=0.99\tau=0.99 (Eq. (3)); all PO coverage rewards are computed from discrete-visibility sampling (M=500M=500), after which the encoder is frozen.

Inputs: a polygon’s vertices are consumed by the LSTM in the source-library file order — no canonical start vertex and no orientation (clockwise/counter-clockwise) are imposed — and the coordinates are min–max normalized per polygon to [0,1]2[0,1]^{2} before the encoder. The normalization makes the policy invariant to translation and to axis-aligned rescaling by construction; it is not invariant to rotation or to a cyclic re-indexing of the vertices, whose empirical effect we quantify in Section 8.

Targets: for each training polygon we obtain SLS(P)S^{\star}_{\mathrm{LS}}(P) by running local search on the policy’s seed under a coverage gate τ=0.99\tau=0.99 against a discrete-visibility reference (Section 4.1); exact coverage-area computation is reserved for evaluation. These targets are not produced by a policy-independent classical solver: they come from a best-improvement editor (add/remove/swap) initialized at the policy’s own greedy-decoded seed and scored on discretized visibility. We refer to this policy-seeded refinement as “LS” throughout; it is the same edit family whose learned counterpart we analyze in Section 6.4, here run to its best-improvement optimum as an oracle teacher rather than learned.

Sizes: the model dimensions of Section 4.2 take the following values:

working width dd 128128
self-attention blocks LL 33
attention heads HH 88
FFN expansion factor ff 22
policy parameters  364k{\approx}\,364\text{k}
SetPredictor parameters 464,001464{,}001

Optimization: weighted BCE (Eq. (10)) with positive-class weight w+5.66w_{+}\approx 5.66 (the empirical non-guard-to-guard ratio on train), 6060 epochs, batch size 3232, learning rate 3×1043\times 10^{-4}, AdamW [25]. Both the headline probe and the no-encoder ablation are four independent runs with seeds {1234,11,22,33}\{1234,11,22,33\} (reported as mean ±\pm std); all figures display seed 12341234. Checkpoints are selected on dev; test, ood, and ood-large are touched only after training and threshold selection are complete.

5.3 Evaluation protocol and metrics

For each polygon we greedy-decode the policy’s EOS\mathrm{EOS}-truncated seed and evaluate it with CGAL exact polygon visibility [31]; separately, we run one SetPredictor pass, threshold at tt to obtain S(t)S(t), and evaluate S(t)S(t) with CGAL. Let NN be the partition size. We report mean coverage 1NCovP(S)\frac{1}{N}\sum\mathrm{Cov}_{P}(S), the normalized guard count |S|/n|S|/n, and the approximation ratio |S|/OPT|S|/\mathrm{OPT} against the vertex-guard optimum. Because the mean hides the tail, we also report the per-polygon counts #{CovP(S)c}\#\{\mathrm{Cov}_{P}(S)\geq c\} for c{0.95,0.99,0.999,1}c\in\{0.95,0.99,0.999,1\} and the worst case minPCovP(S)\min_{P}\mathrm{Cov}_{P}(S); we call CovP(S)0.95\mathrm{Cov}_{P}(S)\geq 0.95 the feasibility threshold and report the feasibility rate #{CovP(S)0.95}/N\#\{\mathrm{Cov}_{P}(S)\geq 0.95\}/N with Wilson 95%95\% confidence intervals. Where we compare the policy’s and the probe’s feasibility rates on the same polygons, we test the paired per-polygon outcomes with an exact McNemar test. The reporting threshold cc is distinct from the training-time gate τ=0.99\tau=0.99 of Eq. (3). As noted in Section 5.1, OPT\mathrm{OPT} values are taken from the vertex-guard optimal solutions distributed with the AGPVG library (for library instances) or computed via ILP (for the random class). For 7979 of the ood-large polygons (all with n800n\geq 800) no optimum is available — most are library instances whose distributed solutions do not cover them (they are open at the source), and for the remainder our ILP did not terminate within budget. Restricting |S|/OPT|S|/\mathrm{OPT} to the 206206 solved instances would bias the ratio toward the easier 600600700700-vertex polygons, so on ood-large we report coverage and the normalized guard count |S|/n|S|/n over all 285285 polygons but no approximation ratio.

Computational cost

All experiments were run on a single workstation with an AMD Ryzen 9 3900X CPU, 64 GB RAM, and an NVIDIA GeForce RTX 2080 SUPER GPU (8 GB VRAM). Training the SetPredictor probe is light: about 55 s per epoch, 5\approx 5 minutes for a 6060-epoch run on the GPU above (the four-seed ensemble, 20\approx 20 minutes); the frozen PO/BT policy is reused rather than retrained. The geometric preprocessing — the exact per-vertex CGAL visibility polygons and the M=500M=500 discretized-visibility matrix — is computed once per polygon and cached for the \sim9.99.9k library polygons, so it is not repeated across rollouts, edits, or seeds. At inference the method is a single geo-free forward pass: the full policy-encode-plus-probe pipeline runs in 25\approx 25 ms at n=1000n=1000 and 57\approx 57 ms at n=2000n=2000 on a GPU (89\approx 89 and 264\approx 264 ms on CPU), the probe itself contributing only 10\approx 10 ms at n=2000n=2000. Exact CGAL coverage is used solely for evaluation and is the dominant offline cost, scaling with the guard-set size.

6 Results

The results follow the hypothesis: the policy places guards but leaves a tail (Section 6.1), reading the frozen encoder closes it (Section 6.2), the effect generalizes out of distribution (Section 6.3), and a single pass suffices because the probe is a fixed point under iteration (Section 6.4).

6.1 The policy places guards

Method Mean cov #{Cov0.95}/N\#\{\mathrm{Cov}\geq 0.95\}/N Mean |S|/n|S|/n Mean |S|/OPT|S|/\mathrm{OPT}
Classical baseline (non-learned)
     Greedy 0.9994 362/362 0.1524 1.0092
Learned policy / probe
     Pretrained pointer (seed) 0.9689 293/362 (0.766,0.847) 0.1664 1.0885
     SetPredictor full (t=0.20t=0.20) 0.9985±\pm 0.0010 361.5±0.9361.5\pm 0.9/362 0.3888 ±\pm 0.0603 2.5560 ±\pm 0.4027
     SetPredictor, no-encoder (t=0.20t=0.20) 0.9972 ±\pm 0.0013 361.8±0.4361.8\pm 0.4/362 0.4781 ±\pm 0.0504 3.1093 ±\pm 0.3350
Oracle editor (SetPredictor target, policy-seeded)
     LS on policy seed 0.9948 362/362 0.1369 0.8854
Table 3: Held-out evaluation on test (362362 polygons). Classical greedy is the only non-learned anchor. The seed row is the PO/BT policy decoded greedily; LS on policy seed is its local-search refinement — the SetPredictor’s supervised target, not a classical baseline (Section 5.2). SetPredictor full and no-encoder are the headline probe and its zeroed-embedding ablation, each mean ±\pm std over four seeds {1234,11,22,33}\{1234,11,22,33\} at t=0.20t=0.20 (Section 6.2). Both probes saturate the 0.950.95 gate (361/362\geq 361/362); the encoder’s effect surfaces at the stricter gates of Table 4 and in the 20%\approx\!20\% higher guard cost the no-encoder probe pays (|S|/OPT|S|/\mathrm{OPT} 3.113.11 versus 2.562.56). Wilson 95%95\% CI shown only where the proportion is informative.
|S|/OPT<1|S|/\mathrm{OPT}<1 for the oracle editor because it halts at a disc-vis coverage gate (τ=0.99\tau=0.99) and slightly undercovers (mean exact coverage 0.9950.995), so it uses fewer guards than the full-coverage optimum. This is an artifact of the training reward, not a loose OPT\mathrm{OPT}; the |S|/n|S|/n column is free of this caveat.

The geo-free policy on its own (Table 3, seed row) places guards at near-classical cardinality — its mean |S|/n|S|/n (0.1660.166) is close to classical greedy (0.1520.152), at |S|/OPT1.09|S|/\mathrm{OPT}\approx 1.09 — and clears the feasibility line on 293293 of the 362362 test polygons (81%81\%, Wilson 95%95\% CI: 0.7660.766, 0.8470.847). The LS oracle (bottom row) achieves 362/362362/362 feasibility at lower cardinality (|S|/n=0.137|S|/n=0.137, |S|/OPT=0.885|S|/\mathrm{OPT}=0.885), but is not geo-free: it reads discretized visibility at every add/remove/swap step and serves here only as the probe’s training target, not as a geo-free competitor. The results show that vertex-guard placement is learnable, to a degree, by a method that never reads an explicit visibility object at inference. The remaining 6969 polygons (19%19\%) fall below the feasibility line under the policy alone; this tail is the paper’s central motivating observation (the distribution is Section 6.2), and the method gives no feasibility guarantee on its own. Figure 2 shows the seed, the seed plus probe at t=0.20t=0.20, and the optimum side by side for an in-distribution and an out-of-distribution polygon.

Refer to caption
Figure 2: Two polygons drawn with the policy’s seed (left column), the policy plus the SetPredictor probe at t=0.20t=0.20 (middle column), and the optimum (right column). Vertices selected as guards are marked. Row 1 is an in-distribution polygon where the policy already covers most of the region. Row 2 is an out-of-distribution polygon (n=350n=350, about 1.8×1.8\times the largest training polygon) where the policy’s seed falls below the feasibility line (CovP()=0.942\mathrm{Cov}_{P}(\cdot)=0.942) and the probe lifts it back above (CovP()=0.996\mathrm{Cov}_{P}(\cdot)=0.996) at a guard cost typical for this split (|S|/OPT2.4|S|/\mathrm{OPT}\approx 2.4; cf. Table 7).

6.2 Reading the encoder closes the tail

Method #{Cov=1}\#\{\mathrm{Cov}=1\} #{Cov0.999}\#\{\mathrm{Cov}\geq 0.999\} #{Cov0.99}\#\{\mathrm{Cov}\geq 0.99\} #{Cov0.95}\#\{\mathrm{Cov}\geq 0.95\} min Cov\mathrm{Cov}
Pretrained pointer (seed) 3 15 97 293/362 0.830
SetPredictor t=0.20t=0.20 (full) 240 310 358 362/362 0.977
SetPredictor t=0.20t=0.20, no-encoder 120 163 306 362/362 0.952
SetPredictor t=0.25t=0.25 (full) 194 280 356 362/362 0.962
SetPredictor t=0.30t=0.30 (full) 151 219 350 362/362 0.952
Table 4: Per-polygon coverage distribution on test (362362 polygons). Counts are polygons reaching each coverage threshold; min CovP(S)\mathrm{Cov}_{P}(S) is the worst polygon. The policy has a long left tail; the SetPredictor shifts the whole distribution right. This table exposes the encoder’s contribution: where the 0.950.95 gate saturates, the stricter 0.990.99/0.9990.999 gates and minCovP(S)\min\mathrm{Cov}_{P}(S) separate the full probe from the no-encoder ablation — at t=0.20t=0.20, 44 polygons below 0.990.99 versus 5656, and a worst polygon lifted from 0.9520.952 to 0.9770.977. All rows are the displayed seed (12341234); Table 3 gives the four-seed mean±\pmstd.

The policy’s mean coverage hides a bimodal distribution (Table 4): it reaches full coverage on only a handful of polygons and trails a long left tail down to a worst polygon at 0.8300.830, while the SetPredictor at t=0.20t=0.20 eliminates that tail and lifts the bulk above 0.990.99. On the displayed seed the probe clears all 6969 of the policy’s sub-feasibility polygons and introduces no new failures (69069\to 0 below 0.950.95; paired exact McNemar test, p3×1021p\approx 3\times 10^{-21}); the same paired test is applied at OOD scale in Section 6.3. The shift holds under the strict full-coverage requirement that defines the problem (CovP(S)=1\mathrm{Cov}_{P}(S)=1): the probe covers 240240 of the 362362 polygons exactly against the policy’s 33 (Table 4), so the gain is not an artifact of reading feasibility at the 0.950.95 gate.

The next question we ask is whether closing that tail is the encoder’s doing or coordinates alone suffice. We isolate the encoder’s contribution by retraining the probe with the frozen embedding hih_{i} masked to zero on every forward pass — architecture, parameter count, optimizer, and training schedule identical (Section 5.2). The 0.950.95 feasibility gate does not reveal the encoder’s contribution, as both probes saturate there (362/362362/362, Table 3) and the headline mean coverage is likewise nearly identical. The encoder’s contribution lives in the tail. At the stricter 0.990.99 gate of Table 4, on the displayed seed the full probe leaves 44 polygons below against 5656 for the no-encoder ablation, a fourteen-fold difference, and lifts the worst polygon from 0.9520.952 to 0.9770.977 (across the four seeds the mean count below 0.990.99 is 11\approx 11 for the full probe and 29\approx 29 for the no-encoder, of which the displayed seed’s 5656 is the worst — Table 5). The gap becomes qualitative out of distribution (Section 6.3). This is the direct measurement behind C1 — the frozen encoder carries geometric structure that raw coordinates do not reproduce, even with the decoder’s full parameter budget.

The no-encoder ablation removes the encoder but keeps the policy’s seed indicator among the probe’s inputs, so the recovered feasibility could in principle be credited to the decoder’s guard set rather than to the encoder. We therefore complete the 2×22\times 2 design, masking the seed indicator as well (Table 5, four-seed mean ±\pm std at t=0.20t=0.20). The two axes must be read together, because at a fixed threshold the four conditions trade coverage against guard count differently. Masking the seed alone barely moves the probe: the encoder-only probe nearly matches the full probe both in the tail (17.2±2.417.2\pm 2.4 versus 11.0±8.711.0\pm 8.7 polygons below 0.990.99) and in cost (|S|/OPT|S|/\mathrm{OPT} 2.422.42 versus 2.562.56). Masking the encoder alone is markedly worse on the tail (29.0±18.029.0\pm 18.0 below 0.990.99 at |S|/OPT|S|/\mathrm{OPT} 3.113.11). The coords-only probe (neither input) appears to have the smallest tail of all, but this is misleading: its threshold barely functions — it selects 0.720.72, 0.700.70, and 0.670.67 of all vertices at t=0.20,0.25,0.30t=0.20,0.25,0.30 — so it holds coverage by flooding guards rather than selecting them, never entering the guard-count regime the other probes operate in (|S|/OPT=4.64|S|/\mathrm{OPT}=4.64, against 2.42.42.62.6). Reading the axes together, the seed contributes little, whereas the encoder is what makes economical feasibility reachable at all: it is the representation, not the decoder’s seed, that lets the probe place few guards and still cover.

Probe inputs #{Cov<0.99}\#\{\mathrm{Cov}<0.99\} minCov\min\mathrm{Cov} |S|/n|S|/n |S|/OPT|S|/\mathrm{OPT}
SetPredictor full 11.0±8.711.0\pm 8.7 0.951 ±\pm 0.029 0.39 ±\pm 0.06 2.56 ±\pm 0.40
     no-seed (encoder only) 17.2±2.417.2\pm 2.4 0.918 ±\pm 0.029 0.37 ±\pm 0.01 2.42 ±\pm 0.07
     no-encoder (seed only) 29.0±18.029.0\pm 18.0 0.947 ±\pm 0.022 0.48 ±\pm 0.05 3.11 ±\pm 0.34
     coords-only (neither) 0.8±0.80.8\pm 0.8 0.962 ±\pm 0.044 0.72 ±\pm 0.01 4.64 ±\pm 0.06
Table 5: Encoder–seed ablation on test (362362 polygons), all probes at t=0.20t=0.20, four-seed mean ±\pm std over {1234,11,22,33}\{1234,11,22,33\}. The four rows are the 2×22\times 2 on/off combinations of the two probe inputs — the frozen encoder embedding and the policy’s seed indicator; an absent input is masked to zeros, leaving architecture and parameter count unchanged. The two axes must be read together — the coverage tail (#{Cov<0.99}\#\{\mathrm{Cov}<0.99\}, minCov\min\mathrm{Cov}) against guard cost (|S|/n|S|/n, |S|/OPT|S|/\mathrm{OPT}) — because the same threshold t=0.20t=0.20 does not produce the same operating regime across rows: a probe that selects nearly all vertices will show a short tail simply by over-guarding, not by genuinely learning to place guards. The encoder-only probe nearly matches the full probe on both; the seed-only probe (the no-encoder row of Table 4) widens the tail at higher cost. The coords-only probe is degenerate: its threshold barely functions — it selects 0.720.72, 0.700.70, 0.670.67 of all vertices at t=0.20,0.25,0.30t=0.20,0.25,0.30 — so it holds coverage by flooding guards rather than selecting them, and its small #{Cov<0.99}\#\{\mathrm{Cov}<0.99\} is an artifact of over-guarding (|S|/OPT=4.64|S|/\mathrm{OPT}=4.64), not selection quality. Only the encoder-bearing probes reach the economical regime (|S|/OPT2.4|S|/\mathrm{OPT}\approx 2.42.62.6); without the encoder the probe is stuck at 3.1×3.1\times (seed only) or 4.6×4.6\times (neither). The encoder is what makes economical feasibility reachable.

Since the SetPredictor is itself a small Transformer, one might worry that it computes the geometry and the encoder merely passes coordinates through. To rule this out we strip the probe of all capacity: a plain linear classifier (without any hidden layers or attention) reading only the frozen per-vertex features, asked to separate guard from non-guard vertices. It already reaches ROC-AUC0.84\mathrm{ROC\text{-}AUC}\approx 0.84, a measure of how reliably the rule ranks a guard above a non-guard (0.50.5 is chance, 1.01.0 perfect).222L2L_{2}-regularized logistic regression under 55-fold polygon-grouped cross-validation over 200200 polygons / 12,42412{,}424 vertices: ROC-AUC=0.843±0.009\mathrm{ROC\text{-}AUC}=0.843\pm 0.009 and PR-AUC=0.582±0.027\mathrm{PR\text{-}AUC}=0.582\pm 0.027 (mean ±\pm std across folds) against a 0.160.16 positive-class prior. A linear readout cannot be credited with computing geometry itself [1, 16], so the guard-relevant structure must already be linearly accessible in the frozen embedding — the encoder, not the probe, carries it. Scored by such a linear rule, guard vertices pile up at higher scores than non-guards, with only modest overlap, as visible in Figure 3.

Refer to caption
Figure 3: Guard vertices are linearly separable in the frozen encoder features. A one-dimensional linear rule (LDA) over the same 200200-polygon / 12,42412{,}424-vertex probe set scores each vertex; the histograms show that score for guard and non-guard vertices separately. The rule is fit and scored out-of-fold (GroupKFold over polygons — no vertex is scored by a rule that saw its own polygon), so the separation reflects embedding structure rather than memorization. Guards score higher; the shaded band is the overlap the rule confuses. The separation corresponds to ROC-AUC0.84\mathrm{ROC\text{-}AUC}\approx 0.84, matching the logistic probe in the text.

Closing the tail has a price, with the threshold tt as the knob: the approximation ratio rises as tt falls (Table 6), reaching 2.6×\approx\!2.6\times optimum at the headline threshold against 1.1\approx 1.1 for the policy alone. Figure 4 gives the full picture in and out of distribution: the coverage CDF sits to the right of the policy at every threshold, while the |S|/n|S|/n and |S|/OPT|S|/\mathrm{OPT} box plots show the matching cost. We report the trade-off rather than fix one operating point.

Threshold tt Mean Cov\mathrm{Cov} Mean |S|/n|S|/n Mean |S|/OPT|S|/\mathrm{OPT} #{Cov0.95}/N\#\{\mathrm{Cov}\geq 0.95\}/N
0.20 0.9993 0.4376 2.8972 362/362
0.25 0.9988 0.3802 2.5082 362/362
0.30 0.9979 0.3353 2.2041 362/362
Table 6: Cost of feasibility on test (362362 polygons), displayed seed (12341234). The four-seed mean |S|/OPT|S|/\mathrm{OPT} is 2.02±0.212.02\pm 0.21 at t=0.30t=0.30 and 2.56±0.402.56\pm 0.40 at t=0.20t=0.20.
Refer to caption
Figure 4: Distributional view across all three regimes: test (top, 362362 polygons), ood (middle, 20812081), and ood-large (bottom, 285285, nn up to 22502250). Left: per-polygon coverage as a complementary stair-step CDF; the dashed line marks the 0.950.95 feasibility gate. Center: box plots of |S|/n|S|/n for the policy seed and the three probe thresholds. Right: the approximation ratio |S|/OPT|S|/\mathrm{OPT} for test and ood only; omitted for ood-large, where OPT\mathrm{OPT} is unavailable for the larger polygons (Section 5.3). Boxes show the displayed seed (12341234); four-seed mean±\pmstd is in Tables 3 and 7.

6.3 Out-of-distribution generalization

Method Mean cov #{Cov0.95}/N\#\{\mathrm{Cov}\geq 0.95\}/N Mean |S|/n|S|/n Mean |S|/OPT|S|/\mathrm{OPT}
Pretrained pointer (seed) 0.9567 1778/2081 (0.839,0.869) 0.1807 1.1488
SetPredictor full (t=0.20t=0.20) 0.9946±\pm 0.0020 2064.2±12.22064.2\pm 12.2/2081 0.3617 ±\pm 0.0447 2.3341 ±\pm 0.2951
SetPredictor, no-encoder (t=0.20t=0.20) 0.9849 ±\pm 0.0055 2037.2±22.62037.2\pm 22.6/2081 0.3667 ±\pm 0.0550 2.3583 ±\pm 0.3646
Table 7: Out-of-distribution evaluation on ood (20812081 polygons, nn up to 10001000). Rows as in Table 3: the policy seed, the full probe, and the no-encoder ablation, both probes the four-seed mean ±\pm std at t=0.20t=0.20.

The next test of the representation is the ood split, a larger held-out set that spans the training size range and extends well beyond it, reaching n=1000n=1000 — about five times the largest training polygon. Table 7 reports it. The policy alone drops to mean coverage 0.9570.957, with 303/2081303/2081 polygons below the feasibility line (85%85\% feasible). The probe at t=0.20t=0.20 (four-seed mean) cuts that to 16.8±12.216.8\pm 12.2 (99%99\% feasible) at mean coverage 0.9950.995 — roughly an order-of-magnitude reduction in failures. The improvement is significant for every probe seed individually (per-seed failure counts 33, 88, 2222, 3434; exact McNemar test on the paired per-polygon outcomes, p<1.2×1075p<1.2\times 10^{-75} in all four cases), and the per-seed Wilson 95%95\% intervals for the probe’s feasibility rate all lie above 0.9770.977. The no-encoder ablation leaves 44±2344\pm 23 below feasibility (four-seed mean) at mean coverage 0.9850.985 — about 2.6×2.6\times the full probe’s failures (Table 7), so out of distribution the encoder’s contribution is qualitative, not the saturated tie it is in distribution. This is the evidence behind C2: a small classifier over the frozen embeddings recovers feasibility across this held-out set. The gain is not an artifact of the in-distribution-size majority of ood: on the genuinely larger-than-training polygons alone (n>198n>198, 885885 instances) the policy is feasible on 85%85\% and the probe on 98.7%98.7\% across four seeds (100%100\% on the displayed seed). The reduction is robust across seeds, though the residual failure count is seed-dependent (the per-seed counts above), so at-scale generalization is partially seed-dependent. Figure 4 (middle and bottom rows) shows the matching coverage and cost distributions for ood and ood-large.

Method Mean Cov\mathrm{Cov} minCov\min\mathrm{Cov} #{Cov0.95}/N\#\{\mathrm{Cov}\geq 0.95\}/N Mean |S|/n|S|/n
Pretrained pointer (seed) 0.8678 0.038 210/285 0.1849
SetPredictor full (t=0.20t=0.20) 0.9630±\pm 0.0209 0.612 ±\pm 0.286 253.5±19.9253.5\pm 19.9/285 0.2951 ±\pm 0.0579
SetPredictor, no-encoder (t=0.20t=0.20) 0.9295 ±\pm 0.0196 0.091 ±\pm 0.092 251.0±13.8251.0\pm 13.8/285 0.3053 ±\pm 0.0477
Table 8: Extreme out-of-distribution evaluation on ood-large (285285 polygons, nn from 600600 to 22502250), coverage scored by exact CGAL at t=0.20t=0.20 over all 285285. The approximation ratio is omitted because OPT\mathrm{OPT} is unavailable for 7979 polygons (n800n\geq 800; Section 5.3). The full-probe and no-encoder rows are four-seed mean ±\pm std; their minCov\min\mathrm{Cov} is the mean ±\pm std of the four per-seed worst-polygon coverages.

Extreme out-of-distribution

On the ood-large split (nn from 600600 to 22502250, up to about eleven times the largest training polygon), the worst-case coverage is the more informative metric, as the mean can look reasonable while the tail collapses (Table 8). The pretrained seed leaves its worst polygon at coverage 0.0380.038 with 75/28575/285 below feasibility, and the no-encoder ablation barely moves it: its worst-polygon coverage averages only 0.09±0.090.09\pm 0.09 across the four seeds (three of the four still pinned at 0.0380.038), so it trails the full probe on the mean (0.9300.930 versus 0.9630.963) and collapses on the tail. The full probe lifts the worst polygon for every seed: across the four seeds the worst-polygon coverage at t=0.20t=0.20 averages 0.61±0.290.61\pm 0.29 (against the no-encoder’s 0.09±0.090.09\pm 0.09), and on the displayed seed reaches 0.9510.951 with nothing left below feasibility. The seed dependence is genuine — the four-seed feasibility count is 253.5±19.9253.5\pm 19.9 of 285285, with the per-seed count below feasibility ranging from 0 (the displayed seed) to 5555. At t=0.20t=0.20 the probe uses |S|/n0.30|S|/n\approx 0.30 guards on this split, the cost of restoring feasibility this far beyond the training range.

6.4 Alternatives to the probe: decode search, editors, and the fixed point

Decode-time search does not close the tail

The simplest alternative to the probe is to decode the policy better rather than read its encoder. We test this directly (Table 9): for each test polygon we draw K=32K=32 stochastic rollouts from the frozen policy and select among them geo-free, by length-normalized log-likelihood, with no visibility oracle. The result is statistically indistinguishable from greedy — 294294 versus 293293 of 362362 feasible, 266266 versus 265265 below 0.990.99, identical |S|/OPT1.09|S|/\mathrm{OPT}\approx 1.09 — because the policy’s maximum-likelihood decode already is the greedy seed: likelihood-ranked search never prefers a more-covering set, and occasionally prefers a less-covering one (its worst polygon drops to 0.5570.557). Letting an oracle pick the highest-coverage sample of the 3232 — classical active search, no longer geo-free — raises feasibility to 333/362333/362 but still leaves 199199 below 0.990.99, far short of the probe (362/362362/362 feasible, 4\approx 4 below 0.990.99 on the displayed seed). The tail is thus not a decoding artifact that more search removes: closing it requires reading the encoder, and the probe pays for that in guards (|S|/OPT2.6|S|/\mathrm{OPT}\approx 2.6) rather than in search.

Decode of the frozen policy #{Cov0.95}/N\#\{\mathrm{Cov}\geq 0.95\}/N #{Cov<0.99}\#\{\mathrm{Cov}<0.99\} Mean Cov\mathrm{Cov} minCov\min\mathrm{Cov} |S|/n|S|/n |S|/OPT|S|/\mathrm{OPT}
greedy (seed) 293/362 265 0.969 0.830 0.166 1.09
best-of-3232, likelihood (geo-free) 294/362 266 0.967 0.557 0.166 1.09
best-of-3232, coverage oracle (not geo-free) 333/362 199 0.981 0.845 0.171 1.12
Table 9: Decode-time search on test (362362 polygons), all reading the same frozen PO/BT policy. greedy is the seed of Table 3. best-of-3232, likelihood draws K=32K=32 stochastic rollouts and keeps the one with the highest length-normalized log-likelihood — a geo-free decode search with no visibility oracle. best-of-3232, coverage oracle instead keeps the highest-coverage sample (tie-break: fewer guards); it consults exact coverage to choose and is therefore not geo-free, reported only as an upper bound on what the policy’s own samples contain. Greedy is included in each candidate pool, so neither selection can underperform it on its own ranking metric. Geo-free search matches greedy and does not close the tail; even oracle-assisted search falls well short of the SetPredictor (Table 4).

Before settling on the single-shot probe we tried to improve the seed with a learned iterative editor applying add/remove/stop\textsc{add}/\textsc{remove}/\textsc{stop} edits. Three architectures of increasing capacity — including one with topology features and an auxiliary visibility-prediction loss, and one with DAgger refresh [30] — failed to improve over the seed in any coverage-preserving regime. The best variant cut |S||S| by about 24%24\% but lost coverage, reaching only a 0.270.27 recovery rate (the fraction of its edits that match or beat the local-search reference on both coverage and guard count) — well below replacement quality. Three structural reasons recur: errors accumulate because the editor sees clean trajectories in training but its own predictions at inference; a single {add,remove,stop}×{1,,n}\{\textsc{add},\textsc{remove},\textsc{stop}\}\times\{1,\dots,n\} head couples selection with termination, miscalibrating stop; and the train/inference state distributions diverge in ways DAgger refresh did not close.

The same teacher-forcing cascade also defeats a supervised pointer network — which we trained directly on AGPVG solutions before adopting the reinforcement recipe. Cloning a guard sequence is ill-defined for a second reason beyond the cascade: a polygon has many optimal and near-optimal guard sets (the constraint CovP(S)=1\mathrm{Cov}_{P}(S)=1 in Eq. (2) fixes the cardinality, not the membership), so any single target sequence is an arbitrary choice among equivalent solutions, and an autoregressive loss penalizes the model for proposing a different but equally valid one. Forced onto sequences it would not itself generate, the recurrent decoder then drifts into hidden states unseen at training and collapses (mean coverage 0.50\approx 0.50 at roughly 3×3\times the optimum) — the brittleness of supervised cloning anticipated in Section 1. Preference optimization sidesteps both problems: it ranks the policy’s own rollouts by reward rather than matching a designated target, so non-unique optima are not in conflict. The single-shot probe likewise has none of these failure modes — no rollout, no stop token, and a per-vertex membership label rather than an ordered sequence, trained and used in the same regime.

A clarification this invites: the probe’s supervised targets SLSS^{\star}_{\mathrm{LS}} are themselves the output of this same edit family run to its best-improvement optimum — an oracle editor seeded by the policy. So the contrast is sharper than “editors fail, the probe works”: the learned editor fails to reproduce the oracle’s edits autoregressively, whereas a single-shot probe trained on the oracle’s endpoints recovers them in one pass. That the probe succeeds is therefore not, by itself, evidence about the encoder — it could merely be distilling the oracle. What attributes the success to the representation is the encoder–seed ablation of Section 6.2 (Table 5), run with the LS target held fixed: masking the seed indicator barely moves the probe, masking the frozen embedding markedly widens the tail, and removing both leaves a probe that stays feasible only by flooding guards (|S|/OPT4.6|S|/\mathrm{OPT}\approx 4.6); a zero-capacity linear probe over the same embedding already separates guards (ROC-AUC 0.8430.843). Were the probe merely distilling the oracle, which inputs it reads would not matter — yet it is the encoder, not the decoder’s seed, that makes the targets predictable from coordinates alone.

Re-running the probe with its own output as the updated seed indicator does not change the prediction in any meaningful way: across K{1,2,3,5}K\in\{1,2,3,5\} the three metrics barely move (Figure 5). Almost all of the movement is a single settle between K=1K=1 and K=2K=2 at the highest threshold (t=0.8t=0.8) — at most  0.02{\approx}\,0.02 in |S|/OPT|S|/\mathrm{OPT} (0.011\approx 0.011 in coverage, 0.004\approx 0.004 in |S|/n|S|/n); beyond K=2K=2 nothing changes by more than  8.5×103{\approx}\,8.5\times 10^{-3} (in |S|/OPT|S|/\mathrm{OPT}; coverage and |S|/n|S|/n stay under 2.5×1032.5\times 10^{-3}). This is the fixed-point property (C4): with the seed indicator already among its inputs, a single pass suffices.

Refer to caption
Figure 5: Iterative inference is a fixed point across all three metrics, on the combined dev++test pool (12241224 polygons). Each panel shows mean coverage (left), |S|/n|S|/n (centre), and |S|/OPT|S|/\mathrm{OPT} (right) versus inference passes K{1,2,3,5}K\in\{1,2,3,5\} for six thresholds t{0.5,0.6,0.65,0.7,0.75,0.8}t\in\{0.5,0.6,0.65,0.7,0.75,0.8\} — a higher range than the headline thresholds, where the fixed-point property is most stringently tested near the decision boundary. The metrics barely move (bounds in Section 6.4); almost all of the change is a single settle between K=1K=1 and K=2K=2 at t=0.8t=0.8.

7 Discussion

The central measurement in this paper is what happens when we read the encoder rather than the decoder. Reading only the frozen embeddings, the vertex coordinates, and the policy’s seed indicator — no visibility matrix, no CGAL output, no per-vertex area feature — the SetPredictor closes the in-distribution feasibility tail on the displayed seed (Table 3) and compresses the out-of-distribution one by roughly an order of magnitude across four probe seeds (Section 6.3). Two readings are consistent with this. Either the encoder learned enough geometric structure that a small classifier over its output recovers feasibility on its own, or the decoder’s EOS\mathrm{EOS} calibration was the bottleneck and the policy had already identified the right vertices but stopped too early. Both point to the same headline conclusion — the reinforcement method internalized a representation that holds more of the relevant geometry than its decoder expressed — and the encoder–seed ablation (Table 5) adjudicates between them in the encoder’s favour: masking the decoder’s expressed seed barely changes the probe, whereas masking the encoder widens the tail or forces wholesale over-guarding, so the carried signal is the encoder’s geometry rather than the decoder’s stopping behaviour. Three measurements pin this down (Section 6.2). The no-encoder ablation — masking the embedding to zeros with everything else held constant — leaves a probe that cannot match the full probe’s coverage at matched cardinality on either split, separating most clearly at the 0.990.99 gate of Table 4. The complementary no-seed control shows the encoder alone nearly recovers the full probe, so the seed is not what carries the signal. And the linear probe (ROC-AUC 0.843±0.0090.843\pm 0.009) shows the guard-relevant structure is linearly accessible in the frozen embedding, with Figure 3 as the matching visualization.

Recovering feasibility has a price, and we report it rather than disguise it. The probe’s guard sets grow to roughly two to three times the optimum as the threshold is lowered (Section 6.2), and the |S|/OPT|S|/\mathrm{OPT} box plots of Figure 4 show that the whole distribution widens, not just its mean. A deployment would pick one operating point on this curve; we leave it as a knob because the right point depends on how costly an extra guard is relative to an uncovered region.

Note that we are not claiming a better AGP solver. Classical greedy reaches mean coverage 0.9990.999 at |S|/n=0.152|S|/n=0.152 and |S|/OPT=1.009|S|/\mathrm{OPT}=1.009 on test (Table 3); the probe at t=0.20t=0.20 reaches mean coverage 0.998\geq 0.998 at |S|/OPT2.6|S|/\mathrm{OPT}\approx 2.6 across four seeds. A greedy-then-local-search pipeline — the standard high-quality classical recipe — would be stronger still on cardinality; the LS on policy seed row of Table 3, which refines a guard set with the same add/remove/swap editor to 362/362362/362 feasibility at |S|/OPT=0.885|S|/\mathrm{OPT}=0.885, already bounds what such refinement achieves on this data. The classical anchors win on cardinality at preserved coverage, and the policy and probe are not built to contest that. What we document is what is learnable by reinforcement when the model is denied geometric input at the moment of placing guards, and what part of that learning lives in the representation rather than the decoder. The contrast between the failed iterative editors (Section 6.4) and the single-shot probe is the cleanest expression of the point: removing the stop time and the rollout is exactly what lets the encoder’s information surface.

8 Limitations

We acknowledge several limitations of the work, enumerated here for clarity; the first two are conceptual and the rest experimental.

  1. 1.

    Scope of the “no geometric knowledge” claim. The geo-free constraint is on inference: the policy and the probe see no visibility object at test time. Training does see visibility: the PO/BT reward is computed from discrete visibility coverage, and the probe’s supervised targets are generated by local search using discrete visibility as a coverage gate. “No explicit geometric knowledge” must be read as “no geometric oracle at the moment of placing guards.”

  2. 2.

    The probe is supervised, not reinforcement. The SetPredictor is trained by BCE against local-search-derived labels. It is not itself a reinforcement learner. We position it as a representation probe of the RL-trained encoder, not as an RL contribution.

  3. 3.

    Held-out partition size. test has 362362 polygons. Wilson 95%95\% CIs on the coverage-feasibility proportion are reported, but the absolute count is modest. The OOD evaluation on ood (20812081 polygons) provides a much larger second held-out evidence base.

  4. 4.

    Policy selection and seeds. The PO/BT policy is the best of several reinforcement-pretraining runs, selected by training-set performance. Because selection used only the train split, test and ood play no role in it and the held-out evaluations are not optimistically biased by the choice; selecting on training reward rather than a held-out criterion is a weaker rule that, if anything, risks favoring an over-fit run, so we make no claim the selected policy is optimal. We do not average over policy seeds; seed-variance is characterized separately for the probe over four seeds {1234,11,22,33}\{1234,11,22,33\} (Table 3, Section 6.3).

  5. 5.

    LSTM, not Transformer, policy. We trialed a Transformer encoder–decoder policy of roughly 6×6\times the parameters; it reached a comparable coverage–cost operating point (greedy coverage 0.97\approx 0.97 at |S|/OPT1.15|S|/\mathrm{OPT}\approx 1.15 versus the LSTM pointer’s 1.21\approx 1.21 — sparser but no better on the trade-off), one training run failed to converge, and we found no improvement justifying the added capacity. We read this as overparametrization relative to the scalar RL signal rather than evidence against attention models in general — consistent with the fact that the probe, a small Transformer trained on dense per-vertex labels, does work; a careful attention-model study is future work.

  6. 6.

    Decode-time search and stronger decoders. We tested decode-time search of the policy seed by best-of-KK stochastic sampling with length-normalized scoring (Section 6.4, Table 9): geo-free likelihood-ranked search matched greedy and did not close the tail, so the residual tail is not a decoding artifact. We did not implement exhaustive beam search — the decoder loop is not exposed for per-step beam expansion, and best-of-KK sampling is the stronger search in practice — nor symmetry-exploiting RL such as POMO [21]; these act on the decoder and are orthogonal to the representational question of what the encoder has learned, so we leave them to future work.

  7. 7.

    Discrete visibility approximation in training. Both the PO/BT coverage reward and the SetPredictor training targets use discrete visibility sampling at M=500M=500 (Section 4.1). Exact coverage-area computation is used only at evaluation; the per-vertex visibility polygons are computed with CGAL in both training and evaluation. We did not sweep MM: the sample average has standard error O(1/M)O(1/\sqrt{M}), so a larger MM — or stratified sampling near reflex vertices, where the uncovered slivers are hardest to detect — would reduce target noise, but quantifying that sensitivity is left to future work.

  8. 8.

    Reward-hyperparameter sensitivity. The reward weights and PO settings (λ=1.0\lambda=1.0, ρ=3.0\rho=3.0, R=8R=8 rollouts, α=0.05\alpha=0.05; Section 5.2) were fixed to the values that produced the released checkpoint rather than swept. The design turns on the form of the reward — capping coverage at τ\tau and linearly penalizing the deficit (Section 4.1) — rather than on the precise scalars; a sensitivity study over them would require re-running PO/BT training and is left to future work.

  9. 9.

    Extreme OOD is strong but seed-dependent. On the ood-large split (Table 8, Section 6.3) the frozen-encoder probe lifts worst-case coverage on every seed while the no-encoder ablation cannot, but the per-seed feasibility count is variable, so we read it as evidence of the encoder signal rather than a feasibility guarantee at this scale; the probe also pays roughly double the local-search guard count, and the vertex-guard optimum is unavailable for the largest polygons there.

  10. 10.

    Training-curve reconstruction. Figure 1 is reconstructed from four saved checkpoints (epochs 110110, 114114, 160160, 200200), not from an online metric log; per-epoch coverage was not preserved during the original PO/BT run. The trace shows only late-training dynamics, which is sufficient to confirm that the gradient direction does not collapse but does not characterize early-training behavior.

  11. 11.

    Invariance and vertex ordering. The recurrent encoder reads vertices in file order, so the pipeline is invariant to translation and axis-aligned scaling by construction (the input is min–max normalized; Section 5.2) but not, in principle, to rotation, reflection, or vertex re-indexing. We measured the effect on test by re-running the frozen policy and probe (t=0.20t=0.20) on transformed polygons, re-decoding the seed each time: under rotation (4545^{\circ}, 9090^{\circ}, 180180^{\circ}) and mirroring the feasibility rate is unchanged (362/362362/362 at the 0.950.95 gate, mean coverage 0.998\geq 0.998), so in aggregate the learned features tolerate these transforms; a cyclic re-indexing of the vertices is slightly more disruptive (359/362359/362 feasible, with one small polygon falling to coverage 0.790.79), consistent with the order-dependence a recurrent encoder introduces. A permutation- or rotation-equivariant encoder (for instance, attention with relative positional encoding) would remove this dependence by construction and is a natural next step.

9 Conclusion

We asked what a neural policy internalizes about vertex-guard placement when trained from a coverage-aware reward over its own rollouts and denied any geometric oracle at inference, and we answered it by reading the encoder rather than the decoder: a single-shot probe over the frozen embeddings closes the in-distribution feasibility tail on the displayed seed and compresses the out-of-distribution tail by roughly an order of magnitude across four probe seeds, which we take as evidence that the reinforcement-trained encoder had captured more of the placement geometry than its decoder emitted as guards. The result is a measurement in a deliberately constrained setting, not a competitor to classical solvers on guard count — the learner is not the best AGP heuristic, but it arrived at a usable representation of vertex-guard placement without ever being shown the geometry it had to reason about at inference. Two extensions follow naturally: a stronger encoder, such as a Transformer trained under the same PO/BT objective, may shorten the residual tail directly and leave less work for the probe; and a per-instance threshold, predicted geo-free from the same embeddings but trained against coverage-selected targets exactly as the probe is trained against the visibility-derived labels SLSS^{\star}_{\mathrm{LS}}, would replace the global knob with an adaptive one — keeping visibility a training-time target, never an inference-time input. More broadly, the encoder-probing decomposition is not specific to vertex guarding: it should transfer to other geometric set-cover problems whose feasibility oracle is costly at inference — terrain guarding or range-limited watchtower placement, for instance — where geo-free inference stays meaningful precisely because the visibility or coverage oracle one would otherwise consult is expensive to evaluate. Whether a frozen encoder carries the requisite structure in those settings is an open question the present method is designed to ask.

Data and Code Availability

The AGPVG instance library is publicly available [9]. The 313313 supplemental random-class polygons with their ILP-computed vertex-guard optima, the dataset splits, the trained checkpoints, and all training and evaluation code are available at https://github.com/dseverdi/AGNet.

Conflict of Interest

The authors declare no competing interests.

References

  • [1] G. Alain and Y. Bengio (2017) Understanding intermediate layers using linear classifier probes. External Links: 1610.01644 Cited by: item C1, §3, §6.2.
  • [2] Y. Belinkov (2022) Probing classifiers: promises, shortcomings, and advances. Computational Linguistics 48 (1), pp. 207–219. Cited by: item C1, §3.
  • [3] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §1, §2.2, §3.
  • [4] B. Ben-Moshe, M. J. Katz, and J. S. B. Mitchell (2007) A constant-factor approximation algorithm for optimal 1.5D terrain guarding. SIAM Journal on Computing 36 (6), pp. 1631–1647. Cited by: §3.
  • [5] Y. Bengio, A. Lodi, and A. Prouvost (2021) Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research 290 (2), pp. 405–421. External Links: Document, ISSN 0377-2217, Link Cited by: §3.
  • [6] B. S. Bigham, S. Dolatikalan, and A. Khastan (2022) Minimum landmarks for robot localization in orthogonal environments. Evol. Intell. 15 (3), pp. 2235–2238. External Links: Link, Document Cited by: §1.
  • [7] R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3–4), pp. 324–345. Cited by: §1, §3.
  • [8] C. Caramanis, D. Fotakis, A. Kalavasis, V. Kontonis, and C. Tzamos (2023) Optimizing solution-samplers for combinatorial problems: the landscape of policy-gradient methods. Note: NeurIPS 2023 External Links: 2310.05309 Cited by: §3, §4.1.
  • [9] M. C. Couto, P. J. de Rezende, and C. C. de Souza (2009) Instances for the Art Gallery Problem. Note: http://www.ic.unicamp.br/˜cid/Problem-instances/Art-Gallery Cited by: §3, §5.1, Data and Code Availability.
  • [10] M. C. Couto, P. J. de Rezende, and C. C. de Souza (2011) An exact algorithm for minimizing vertex guards on art galleries. International Transactions in Operational Research 18 (4), pp. 425–448. Cited by: §3, §5.1.
  • [11] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L. Rousseau (2018) Learning heuristics for the TSP by policy gradient. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR), Lecture Notes in Computer Science, Vol. 10848, pp. 170–181. Cited by: §3.
  • [12] K. Elbassioni, D. Matijević, J. Mestre, and D. Ševerdija (2008) Improved approximations for guarding 1.5-dimensional terrains. CoRR abs/0809.0159v1. Cited by: §3.
  • [13] A. Elnagar and L. Lulu (2005) An art gallery-based approach to autonomous robot motion planning in global environments. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 2079–2084. External Links: Document Cited by: §1.
  • [14] S. K. Ghosh (2010) Approximation algorithms for art gallery problems in polygons. Discrete Applied Mathematics 158 (6), pp. 718–722. Cited by: §3.
  • [15] H. González-Baños (2001) A randomized art-gallery algorithm for sensor placement. Proceedings of the 17th Annual Symposium on Computational Geometry (SoCG), pp. 232–240. Cited by: §3.
  • [16] J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2733–2743. Cited by: item C1, §3, §6.2.
  • [17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • [18] Y. Jin, Y. Ding, X. Pan, K. He, L. Zhao, T. Qin, L. Song, and J. Bian (2023) Pointerformer: deep reinforced multi-pointer transformer for the traveling salesman problem. Note: AAAI 2023 External Links: 2304.09407 Cited by: §3, §4.1.
  • [19] M. D. Kaba, M. G. Uzunbas, and S. N. Lim (2016) A reinforcement learning approach to the view planning problem. External Links: 1610.06204 Cited by: §3.
  • [20] W. Kool, H. van Hoof, and M. Welling (2019) Attention, learn to solve routing problems!. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §3.
  • [21] Y. Kwon, J. Choo, B. Kim, I. Yoon, Y. Gwon, and S. Min (2020) POMO: policy optimization with multiple optima for reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. Cited by: §2.2, §3, item 6.
  • [22] D. T. Lee and A. K. Lin (1986) Computational complexity of art gallery problems. IEEE Transactions on Information Theory 32 (2), pp. 276–282. Cited by: §1, §3.
  • [23] K. Li, T. Zhang, and R. Wang (2021) Deep reinforcement learning for multiobjective optimization. IEEE Transactions on Cybernetics 51 (6), pp. 3103–3114. Cited by: §3.
  • [24] Y. Liao, P. Chang, and H. Li (2025) Reinforcement learning for optimize coverage in art gallery problem using Q-learning based in grid world. IEEE Access 13, pp. 52711–52724. Cited by: §3.
  • [25] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), Cited by: §5.2.
  • [26] I. Mehta, S. Taghipour, and S. Saeedi (2022) Pareto frontier approximation network (PA-Net) to solve bi-objective TSP. External Links: 2203.01298 Cited by: §3.
  • [27] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher (1978) An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming 14 (1), pp. 265–294. Cited by: §3.
  • [28] J. O’Rourke (1987) Art gallery theorems and algorithms. Oxford University Press. Cited by: §1, §3.
  • [29] M. Pan, G. Lin, Y. Luo, B. Zhu, Z. Dai, L. Sun, and C. Yuan (2025) Preference optimization for combinatorial optimization problems. In Proceedings of the 42nd International Conference on Machine Learning (ICML), Note: arXiv:2505.08735 Cited by: §1, §3, §4.1, §4.2.
  • [30] S. Ross, G. J. Gordon, and J. A. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 627–635. Cited by: §3, §6.4.
  • [31] The CGAL Project (2024) CGAL user and reference manual. 5.6 edition, CGAL Editorial Board. Note: https://doc.cgal.org/5.6/Manual/packages.html Cited by: §4.1, §5.3.
  • [32] O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 28. Cited by: §2.2, §3, §4.1.
  • [33] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §1, §3.