The new pqR parser, and R’s “else” problem

|

The latest version of pqR (mostly) solves R’s “else” problem, by modifying the new parser previously introduced in pqR.  I’ll explain the “else” problem and solution here, and also present other advantages of pqR’s parser over the R Core parser, including a big speed advantage in one context.

The “else” problem in R

Probably most new R programmers have encountered the following problem. They put some statements like the following in a file, say “sltp.r”:

They then try to execute this script with “source”, and get this result:

Trying to run the script with Rscript has the same problem:

This is puzzling, since the same code works fine inside a function whose body is in curly brackets:

Now, there’s a reason for this. When entering expressions interactively, R is supposed to print the value of the expression as soonas it’s been entered. But how can it know whether an “if” statement at the end of a line with no else clause is complete (there is no “else” part), or whether instead the user intends to enter an “else” clause on the next line?

One can imagine kludges to get around this in many cases, but the following example shows that there’s not going to be a generalsolution:

After entering the first line above, the user expects to see the output below. They don’t expect to be asked to enter more text in case that text might be an “else” clause.

Situations like the one above are fairly rare, however. One could decide that the user has to enter a blank line after such an “if” statement to confirm that there is no “else” clause — that’s what Python does. But let’s suppose we don’t want to do that. Then we have to evaluate the expression containing the “if” immediately, and consider a following “else” to be an error.

But for non-interactive input, there’s no need to disallow a top-level “else” at the start of a line. Yet it has always been an error in R Core implementations. It no longer is an error in pqR — in input from a file read by Rscript, by “source”, or by “parse”, as well when “parse” is applied to a vector of character strings, it is now OK for an “else” be appear at the start of a line. This avoids annoyances when writing scripts, and also problems when pasting code into a script that was copied from inside a function with curly brackets (where “else” at the start of a line has always been legal).

I don’t know whether this problem has persisted so long in R Core implementations just due to inertia, or because it’s hard to modify the R Core parser to do this. It was fairly easy to modify the pqR parser, though there were some picky details involved.

New operators

The new pqR parser has also facilitated the introduction of new operators in pqR. The .. operator for reliable sequence generation was introduced previously. The latest release introduces ! and !! as operator forms of “paste0” and “paste”. It might be hard to modify the R Core parser to include these operators, because their correct parsing requires use of context — “..” is allowed as part of an identifier (but only at the start or end in pqR), and  “!!” can be two successive unary-not operators, but not in a context where the new !! operator can legally appear.

Top-down recursive descent parsing

This leads into the biggest difference between the new pqR parser and the old R Core parser.

The R Core parser is a bottom-up parser automatically generated from a grammar using the Bison/Yacc parser generator. Automatically generated? Sounds wonderful! But one of the inside secrets of computer science is that, despite decades of development, automatic parser generators are not actually very useful. As an illustration of this, gcc previously used such a parser, but abandoned it.

The generally-preferred method is to manually write a top-down “recursive-descent” parser. In theory, this shouldn’t be. Top-down parsers with k-symbol lookahead can handle grammars in the class LL(k); bottom-up parsers with k-symbol lookahead can handle grammars in the class LR(k), which is larger than LL(k). But in practice, most programming languages are in the LL(k) class if one assumes low-level tokenization has been handled, and dealing with funny contextual issues like the pqR .. and !! operators is easier in a top-down parser. Furthermore, the advantage of just writing a grammar and having code for the parser generated automatically is largely illusory, since the code for recursive-descent parsers reads almost like a grammar, with one recursive function for each non-terminal symbol. The code gets more cluttered when one puts in the semantics, but this aspect is also easier in a recursive-descent parser than when using a parser generator.

Parsing speed

The new pqR parser is faster than the R Core parser, but typically only moderately so. One exception, however, is when parsing is done for Rscript, and an expression (e.g., a function definition) extends over many lines — R Core implementations take time growing as the square of the number of lines, whereas pqR takes linear time (as one would expect).

This gross inefficiency is due not just to the R Core parser itself but also to how it interfaces to R’s “read-eval-print loop” (the “REPL”). The R Core implementation used in Rscript first tries parsing the first line of input, and checks whether the parser says it has a complete expression. If not, it tries parsing the first two lines of input. If that doesn’t give a complete expression, it tries parsing the first three lines of input. And so forth.

Here’s an illustration of the resulting inefficiency. File x2.r contains the following:

Files x300.r and x600.r contain the same thing except that there are 300 and 600 repetitions of the line in the function definition rather than two repetitions.

The time for running these scripts with R-3.5.1’s version of Rscript can be seen here:

And for comparison, here are the times with pqR-2018-11-18’s version of Rscript:

One can see that pqR is faster even for x2.r, but more importantly, with pqR the time for x300.r and x600.r is negligibly different, as one would expect since even 2400 additions should take negligible time. The huge increase in time with R-3.5.1 is due to the quadratic growth in the time to parse the function definition.

Related