Peter Jin

An exploratory improvement operator for a forward-chaining prover

May 7, 2024 (last updated May 22, 2024)

Background: forward-chaining theorem proving with choice

The following is a basic sketch of a forward-chaining theorem prover with choice:

in the primary logical context, given:
    a set of logical rules
    a set of logical propositions
    a set of logical goals
while True:
    compute the fixed-point saturation of the non-choice logical rules
    update the primary logical context with the fixed-point computation
    for every logical rule choice:
        create a working copy of the primary logical context
        in the working logical context:
            evaluate the effect of applying the choice
        record the evaluated effect
    select the logical rule choice that has the best effect
    apply the selected choice (in the primary logical context)
    if a logical contradiction was reached:
        raise a logical exception
    if all logical goals are satisfied:
        break

Above, by non-choice and choice logical rules, we refer to the previous note on logical languages.

Forward-chaining theorem provers for geometry solving will often refer to the above forward-chaining sketch as a deductive database, where the choices are restricted to logical rules that produce geometric constructions [CG00].

For our purposes, the key design decision is how to evaluate the effect of making a logical choice. The next section will describe such a design. (For a practical implementation, please see Meringue.)

Logical inference, partial evaluation, and intrinsic visit counts

We proceed with an example. Consider two logical rules as follows:

axm for x, y { a(x), b(x, y), c(y) -: e(x, y) }
axm for x, y { a(x), d(x, y), c(y) -: f(x, y) }

The two primal logical rules above induce several auxiliary logical rules that should be added to the full set of logical rules evaluated during fixed-point saturation. One set of auxiliary rules are essentially identities:

axm for x { a(x) -: a'(x) }
axm for x { c(x) -: c'(x) }
axm for x, y { e(x, y) -: e'(x, y) }
axm for x, y { f(x, y) -: f'(x, y) }
axm for x, y { a(x), b(x, y), c(y) -: abc'(x, y) }
axm for x, y { a(x), d(x, y), c(y) -: adc'(x, y) }

where relations a', etc. are automatically defined as auxiliary relations.

Another set of auxiliary rules consist of all the non-trivial partially evaluated joins of the primal logical rules:

axm for x, y { a(x), c(y) -: ac'(x, y) }
axm for x, y { a(x), b(x, y) -: ab'(x, y) }
axm for x, y { c(x), b(x, y) -: cb'(x, y) }
axm for x, y { a(x), d(x, y) -: ad'(x, y) }
axm for x, y { c(x), d(x, y) -: cd'(x, y) }

For logical rules that produce functional tuples or (in)equalities (i.e. unification), there are corresponding auxiliary logical rules that are triggered by the relevant unifications.

These auxiliary logical rules then directly induce an intrinsically motivated metric of the progress made on partial rule evaluation during logical inference. Specifically, when evaluating a logical rule choice, each produced logical tuple corresponding to an auxiliary relation (except for the tuple corresponding to the choice itself, on which the evaluation is conditioned) then increments a visit count specific to that auxiliary relation. Logical tuples are produced during fixed-point saturation (including both the primal and auxiliary logical rules), and the evaluation with choice may be done recursively.

The selection phase looks at the minimum of all auxiliary visit counts that were updated during the last evaluation, and selects the choice with least min-updated-visit-count. (Choices that do not cause any visit count updates are considered to have infinite min-updated-visit-count.) In practice, the min-updated-visit-count selection rule is augmented with a noise mechanism, akin to epsilon-greedy policies used in reinforcement learning.

At a higher level, the use of auxiliary visit counts induced by auxiliary relations and auxiliary logical rules may be more simply seen as a sufficient implementation of an exploration-based improvement operator for a general forward-chaining prover. That is, the improvement operator takes the state of the forward-chaining prover (i.e. its logical context), and returns a new prover state with fresh logical tuples produced by inference of logical rules, all in bounded space and time.

Implementation details

Let us briefly discuss a couple of implementation details.

One is how to implement the (parallel) evaluation of choices. Two possible decisions here are copy-on-write and reversible data structures/algorithms. Meringue uses a copy-on-write prover context dispatched per thread-worker, which greatly simplifies the initial prototyping. As the copy-on-write implementation currently uses atomics and persistent data structures, an alternative implementation based on reversibility (e.g. via an undo operation implemented on the prover context, and each thread-worker maintaining a thread-local context with mutable data structures and a rollback log) is expected to be more performant, but as reversibility is a very “leaky abstraction,” so its implementation is not to be taken lightly.

A second is how to implement the evaluation of logical rules. Meringue evaluates one-logical-rule-at-a-time, and implements semi-naive evaluation with worst-case-optimal joins [NP12, V12] to evaluate individual logical rules; no geometry-specific logic is involved in the evaluation implementation. There is potentially some low-hanging fruit for speeding up general forward-chaining evaluation.

Forward- v. backward-chaining

While the discussion in this note is focused on forward-chaining (i.e. fixed-point-based) theorem provers, one may also consider how backward-chaining (i.e. backtracking/depth-first-search-based) provers fit into the overall picture. (Prolog is the classic example of a backward-chaining prover.)

The “inner loop” of forward-chaining is usually a join computation implemented by a backtracking/DFS-based algorithm very similar to that of backward-chaining. However, there are two differences. First, forward-chaining then wraps an embarassingly parallel “outer loop” around many independent inner loop joins. (The analog of “outer loop”-parallelization for backward-chaining is called or-parallelism, which considers disjunctive hypothetical branches in parallel; however, forward-chaining is expected to be easier to parallelize than backward-chaining.) Second, the backtracking/DFS implementation of a worst-case-optimal join, as used in forward-chaining, will make different tradeoffs and optimizations compared to the implementation as used in backward-chaining.

Backward-chaining is useful for goal and subgoal decomposition via the backtracking/DFS stack order, which is formally complementary to the forward-chaining approach that takes goals/subgoals as given. Hybrid forward/backward-chaining approaches (that use backward-chaining to construct goals and subgoals, and forward-chaining to search from assumptions toward goals/subgoals) are an underexplored subject.

Comparison with AlphaGeometry

AlphaGeometry [TW24] consists of (1) a symbolic geometry solver, DD+AR, and (2) a transformer-based language model for proposing intermediate geometric constructions.

The DD+AR solver (DD = “deductive database,” AR = “algebraic reasoning”) is essentially a forward-chaining prover augmented with special geometry domain-specific deduction rules. The result is that the fixed-point saturation computed by DD+AR should be larger than the fixed-point of a more generic fixed-point, as in Meringue.

However, DD+AR on its own is not equipped with a method for choosing intermediate geometric constructions; DD+AR usually must be paired with the geometry-tuned language model. Thus, on many olympiad-level problems, DD+AR alone may halt with failure because a successful proof requires an intermediate construction, which DD+AR alone cannot supply. In other words, DD+AR alone may be said to be not complete.

On the other hand, the exploration-based improvement described earlier does serve as a general algorithmic method that may also be applied to choosing intermediate geometry constructions. Forward-chaining with exploration-based improvement will not halt until the proof goal is found, or an externally imposed timeout is reached (or a logical contradiction is uncovered; but such a scenario requires support from the logic); in other words, forward-chaining with exploration-based improvement is complete.

Such an example was found, where forward-chaining with exploration successfully finds a proof of a hard reduction of an olympiad-level geometry problem, but DD+AR alone (without the language model) halts early and thus fails; please see the discussion of IMO 2009 P2 in the README of Meringue.

In addition, the AlphaGeometry transformer-based model that proposes geometric constructions was trained on a large (100 million samples) synthetically generated formal dataset of olympiad-style geometry problems and their corresponding proofs. This is in stark contrast to the exploration-based improvement operator, which is a purely algorithmic method and is not based on pretraining.

Acknowledgments

We thank David Shin for helpful discussions and feedback.

References

[CG00] Shang-Ching Chou, Xiao-Shan Gao, and Jing-Zhong Zhang. “A Deductive Database Approach to Automated Geometry Theorem Proving and Discovering.” Journal of Automated Reasoning 25, pp. 219-246 (2000).

[NP12] Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. “Worst-case Optimal Join Algorithms.” arXiv preprint 1203.1952v1 (2012).

[TW24] Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. “Solving olympiad geometry without human demonstrations.” Nature 625, pp. 476-482 (2024).

[V12] Todd L. Veldhuizen. “Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm.” arXiv preprint 1210.0481v5 (2012).