Peter Jin

An exact parser from natural language math into logical forms

May 7, 2024 (last updated May 29, 2024)

The task considered in this note is the following. Given a natural language math sentence, e.g.:

“Let $ABC$ be an acute triangle with circumcenter $O$. […] Prove that line $AO$ meets $BC$ at $X$.”

parse it into logical forms, e.g.:

$ABC$ <- "triangle"().
"acute"($ABC$).
$O$ <- "circumcenter"($ABC$).
-- [...]
$AO$ <- "line"().
prove ( "meet"($AO$, $BC$) -> $X$ ).

The logical forms corresponding to the natural language math can then be fed as input into, e.g., a forward-chaining prover such as Meringue. (The task of parsing natural language into logical forms has also been called semantic parsing [L16], but we simply call it parsing, hopefully without loss of clarity.)

Note that, depending on the complexity or size of the logical form domain, LLMs of various sizes can already do this task to an extent, either few-shot as a general base or instruction-tuned model, or with task-specific fine-tuning.

As to why one might not use an LLM for this task, such a question gets into the current limitations of LLMs for reasoning in general.

First, let us note that LLMs may be considered artificial general intelligence in the sense that they have generalization capabilities and general knowledge across many tasks and domains. In the space of tasks and domains, LLMs can be said to possess super-human recall (i.e. lower proportion of false negatives). However, it is also empirically observed that LLMs are often wrong (e.g. hallucination, false reasoning). It is in the latter sense that LLMs can also be said to possess sub-par precision (i.e. more false positives than desired or expected).

Thus, an alternative to LLMs may prefer to focus more on precision (fewer false positives) at the cost of recall (i.e. a narrower domain).

A second consideration relates to the inference throughput of LLMs. Single-GPU inference throughput of LLMs is typically around 100-1000 token/s, or 1-10 ms/token (to the nearest order of magnitude) for a single stream of tokens; batched inference can bring the number to around 1000-10000 token/s, or 0.1-1 ms/token [N24].

With the above considerations in mind, a prototype exact parser from natural language mathematics into logical forms was implemented by hand. The prototype implementation is on GitHub.

The “minimum description length” of the parser is rather long, and so a pseudocode description of the algorithm does not really elucidate how it works. However, some high level remarks may be informative.

Cooperative backtracking parser architecture

The parsing algorithm itself is best thought of as a handful of cooperating phases that operate on a shared backtracking stack. During a parse, the cooperating phases alternate, each phase committing an operation (or, choice point) on the shared backtracking stack. A parse failure that arises during any of the phases causes backtracking of the stack in depth-first-search order.

In the current version of the parser, the cooperating phases are: (1) lexical segmentation, (2) lexical tree adjoining, and (3) syntactic tree building. (The algorithm has a strong resemblance to Pratt parsing [P73], which was explicitly an inspiration.)

Lexical segmentation looks ahead in the input text character stream and determines the next lexical item (essentially a token). Lexical segmentation also assigns the lexical item a logical kind, which is analogous to a “part-of-speech.”

Lexical tree adjoining takes the current lexical item and determines which earlier lexical item is a suitable destination as either a parent or a left-sibling in the lexical tree. There are often multiple possible adjoining destinations, in which case they should be ordered from most likely to least.

Syntactic tree building transforms the lexical tree into a syntactic tree (i.e. an AST of the logical form language). This is done bottom-up and incrementally upon the adjoining of the current lexical item. This phase has the most complex implementation, though it looks like just a bunch of nested lookup tables.

All three cooperating phases can independently fail and backtrack using the shared backtracking stack.

During lexical segmentation, the next lexical item may be too long, when instead a string prefix of the item should have been segmented. The assigned logical kind may also cause failure.

During lexical tree adjoining, if the current lexical item were adjoined to the wrong parent or left-sibling, then the next possible adjoining destination would be tried after backtracking. Pairwise comparisons of the lexical items’ logical kinds are sometimes sufficient to catch what would be an erroneous parse, and to instead fail early.

Finally, during syntactic tree building, a typical failure is a missing case in one of the lookup tables. The missing case could simply be an implementation issue, which is manually fixed by adding the relevant case handling code, but could also be an outright erroneous case.

Parser improvement as imitation learning

The strategy by which the parser was written was essentially a form of test-driven development, in which test coverage was measured on a dataset of transcribed IMO shortlist geometry problems. Thus, the parser was incrementally implemented by making small modifications and refactorings to parse new test cases in the dataset. However, we can formulate the process of parser improvement much more rigorously.

Let us consider the process of parser improvement as a learning process. We are given a training dataset consisting of pairs (String, () -> Tree). The first element of a training pair is a natural language math sentence, e.g. from a corpus of olympiad math problems. The second element is a lazy label and is a closure returning the logical form abstract syntax tree (AST) corresponding to the first natural language math sentence. In other words, the learning process is an instance of lazy supervised learning; while training examples need labels, they do not need to be pre-labeled up front.

Formally, a parser is a function:

Parse : String -> Success(Tree)|Failure(ParseState)

where Success(_)|Failure(_) is the notation for a direct sum type (a.k.a. ADT or enum).

Now, suppose that the current parser produces a failed parse, i.e. Failure(ParseState). We can inspect the ParseState to learn more about the parse failure and how to possibly modify the parser to avoid failure. Because our parsing architecture is based on a shared backtracking stack, it is possible to inspect the backtracking history and to find the longest prefix of the input string that yielded an incremental parse before encountering parse failure. More generally, the tree-structured backtracking history of the ParseState allows one to observe the sequence of actual parsing choice points, and to guess at which counterfactual parsing choices could have led to parsing success.

The availability of the longest prefix also allows one to define an ordering on the training dataset. For each training sample that does not yet successfully parse, calculate the longest prefix among its incremental parses, and take the ratio of the longest prefix length to the input string length. Sorting the training samples by decreasing ratio, one attains a lowest-hanging-fruit ordering, where the training samples at the front are closest to being successfully parsed. Thus, efforts to improve the parser should focus on those training samples first.

The reasoning above leads us to the following meta-algorithm for parser improvement:

initialize regression set = empty set
while training is not done:
    choose a training pair (x, y) by the lowest-hanging-fruit ordering
    evaluate the label t of x
    -- i.e., t is the target logical form tree
    if y is a successful parse:
        if y is compatible with t:
            add (x, y) to the regression set
            continue
    while True:
        modify the parser in a small but promising direction
        run the modified parser on x to get a new parse y'
        if y' is a failed parse:
            continue
        if y' is not compatible with t:
            continue
        run the parser on the regression set
        if there are any parsing regressions:
            continue
        break
    add (x, y') to the regression set

Let us briefly analyze the parse improvement meta-algorithm.

1. Parser improvement is formally related to imitation learning algorithms, e.g. DAgger [RG11].

2. As parser improvement involves running the parser on the entire regression set, then its feasibility requires a fast parser.

3. While the initial implementation of parser improvement is based on a human expert evaluating target logical form tree labels and modifying the parser, in theory those two tasks could be partially augmented (or even entirely substituted) by an AI system (e.g. a system based on LLMs), enabling the possible scaling up of parser improvement.

Results on an IMO dataset

The actual training dataset consists of 135 IMO shortlist geometry problems from 2006 to 2022, from which we extracted 578 English (LaTeX) sentences of natural language mathematics. Target logical forms corresponding to those sentences were supplied, lazily, by the author. In practice, fresh parses from the parser itself were also accepted in lieu of the author explicitly deriving a target logical from tree from scratch; this could be considered doubly lazy supervision.

In its current state, the Praline, parser produces 121 gold parses out of 578 total samples, for a training recall of 21%. The training precision is, roughly, 100%, with the caveat that the human expert may not have perfectly labeled every training sample.

Typical “inference throughput” of the parser is about 0.1-1 ms/sentence, or equivalently around 0.01-0.1 ms/token, on a single CPU core (to the nearest order of magnitude). Extrapolating to 100x cores, batched throughput is expected to be 0.0001-0.001 ms/token.

What about the performance impact of backtracking? Among the 121 gold parses, the median backtrack count is 45, the mean 56, the minimum 20, and the maximum 165. These numbers suggest that even simple depth-first backtracking is not too expensive on natural language sentences.

The bulk of the development of the parser occurred over an approximately 8-week period. The first half was spent iterating on the parser architecture and practicing the parser improvement process on each iteration. During the second half, the parsing architecture was frozen on the cooperating backtracking architecture described earlier, and parser improvement proceeded in earnest.

Finally, why did the parser improvement stop at 21% recall? Well, that was simply an arbitrary stopping point. Looking at the actual data, I can see a continued handwritten implementation of the parser reaching over 90% recall before it would make sense to focus on other datasets rather than chasing the long tail of “unique” training samples in the current dataset. However, the next implementation of the parser will likely incorporate lessons learned, with an eye toward automated parser improvement.

Acknowledgments

We thank David Shin for helpful discussions and feedback.

References

[L16] Percy Liang. “Learning Executable Semantic Parsers for Natural Language Understanding.” Communications of the ACM 59(9), pp. 68-76 (2016).

[N24] Nvidia. “H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM.” URL: https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html (2024).

[P73] Vaughan R. Pratt. “Top down operator precedence.” In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (1973).

[RG11] Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011).