| # Error recovery (2022-05-07) |
| |
| Problem: we have two fairly generic cases of recovery bounded within a range: |
| - sequences: `int x; this is absolute garbage; x++;` |
| - brackets: `void foo() { this is garbage too }` |
| |
| So far, we've thought of these as different, and had precise ideas about |
| brackets ("lazy parsing") and vague ones about sequences. |
| Con we unify them instead? |
| |
| In both cases we want to recognize the bounds of the garbage item based on |
| basic token-level features of the surrounding code, and avoid any interference |
| with the surrounding code. |
| |
| ## Brackets |
| |
| Consider a rule like `compound-stmt := { stmt-seq }`. |
| |
| The desired recovery is: |
| - if we fail at `{ . stmt-seq }` |
| - ... and we can find for the matching `}` |
| - then consume up to that token as an opaque broken `stmt-seq` |
| - ... and advance state to `{ stmt-seq . }` |
| |
| We can annotate the rule to describe this: `{ stmt-seq [recovery] }`. |
| We can generalize as `{ stmt-seq [recovery=rbrace] }`, allowing for different |
| **strategies** to find the resume point. |
| |
| (It's OK to assume we're skipping over one RHS nonterminal, we can always |
| introduce a nonterminal for the bracket contents if necessary). |
| |
| ## Sequences |
| |
| Can we apply the same technique to sequences? |
| Simplest case: delimited right-recursive sequence. |
| |
| ``` |
| param-list := param |
| param-list := param , param-list |
| ``` |
| |
| We need recovery in **both** rules. |
| `param` in the first rule matches the *last* param in a list, |
| in the second rule it matches all others. We want to recover in any position. |
| |
| If we want to be able to recovery `int x int y` as two parameters, then we |
| should extract a `param-and-comma` rule that recovery could operate on. |
| |
| ### Last vs not-last elements |
| |
| Sequences will usually have two rules with recovery, we discussed: |
| - how to pick the correct one to recover with |
| - in a left-recursive rule they correspond to last & not-last elements |
| - the recovery strategy knows whether we're recoverying last or not-last |
| - we could have the strategy return (pos, symbol parsed), and artificially |
| require distinct symbols (e.g. `stmt` vs `last_stmt`). |
| - we can rewrite left-recursion in the grammar as right-recursion |
| |
| However, on reflection I think we can simply allow recovery according to both |
| rules. The "wrong" recovery will produce a parse head that dies. |
| |
| ## How recovery fits into GLR |
| |
| Recovery should kick in at the point where we would otherwise abandon all |
| variants of an high-level parse. |
| |
| e.g. Suppose we're parsing `static_cast<foo bar baz>(1)` and are up to `bar`. |
| Our GSS looks something like: |
| |
| ``` |
| "the static cast's type starts at foo" |
| ---> {expr := static_cast < . type > ( expr )} |
| | "foo... is a class name" |
| +---- {type := identifier .} |
| | "foo... is a template ID" |
| +---- {type := identifier . < template-arg-list >} |
| ``` |
| |
| Since `foo bar baz` isn't a valid class name or template ID, both active heads |
| will soon die, as will the parent GSS Node - the latter should trigger recovery. |
| |
| - we need a refcount in GSS nodes so we can recognize never-reduced node death |
| - when a node dies, we look up its recovery actions (symbol, strategy). |
| These are the union of the recovery actions for each item. |
| They can be stored in the action table. |
| Here: `actions[State, death] = Recovery(type, matching-angle-bracket)` |
| - we try each strategy: feeding in the start position = token of the dying node |
| (`foo`) getting out the end position (`>`). |
| - We form an opaque forest node with the correct symbol (`type`) spanning |
| [start, end) |
| - We create a GSS node to represent the state after recovery. |
| The new state is found in the Goto table in the usual way. |
| |
| ``` |
| "the static cast's type starts at foo" |
| ---> {expr := static_cast < . type > ( expr )} |
| | "`foo bar baz` is an unparseable type" |
| +---- {expr := static_cast < type . > (expr)} |
| ``` |
| |
| ## Which recovery heads to activate |
| |
| We probably shouldn't *always* create active recovery heads when a recoverable |
| node dies (and thus keep it alive). |
| By design GLR creates multiple speculative parse heads and lets incorrect heads |
| disappear. |
| |
| Concretely, the expression `(int *)(x)` is a valid cast, we probably shouldn't |
| also parse it as a call whose callee is a broken expr. |
| |
| The simplest solution is to only create recovery heads if there are no normal |
| heads remaining, i.e. if parsing is completely stuck. This is vulnerable if the |
| "wrong" parse makes slightly more progress than the "right" parse which has |
| better error recovery. |
| |
| A sophisticated variant might record recovery opportunities and pick the one |
| with the rightmost *endpoint* when the last parse head dies. |
| |
| We should consider whether including every recovery in the parse forest might |
| work after all - this would let disambiguation choose "broken" but likely parses |
| over "valid" but unlikely ones. |
| |
| |