All my life, my heart has yearned for a thing I cannot name.
André Breton, Mad Love
The interpreter we have so far feels less like programming a real language and
more like punching buttons on a calculator. “Programming” to me means building
up a system out of smaller pieces. We can’t do that yet because we have no way
to bind a name to some data or function. We can’t compose software without a way
to refer to the pieces.
To support bindings, our interpreter needs internal state. When you define a
variable at the beginning of the program and use it at the end, the interpreter
has to hold on to the value of that variable in the meantime. So in this
chapter, we will give our interpreter a brain that can not just process, but
remember.
State and statements go hand in hand. Since statements,
by definition, don’t evaluate to a value, they need to do something else to be
useful. That something is called a side effect. It could mean producing
user-visible output or modifying some state in the interpreter that can be
detected later. The latter makes them a great fit for defining variables or
other named entities.
In this chapter, we’ll do all of that. We’ll define statements that produce
output (print) and create state (var). We’ll add expressions to access and
assign to variables. Finally, we’ll add blocks and local scope. That’s a lot to
stuff into one chapter, but we’ll chew through it all one bite at a time.
We start by extending Lox’s grammar with statements. They aren’t very different
from expressions. We start with the two simplest kinds:
An expression statement lets you place an expression where a statement
is expected. They exist to evaluate expressions that have side effects. You
may not notice them, but you use them all the time in C, Python, Python, and other languages. Any time you
see a function or method call followed by a ;, you’re looking at an
expression statement.
A print statement evaluates an expression and displays the result to
the user. I admit it’s weird to bake printing right into the language
instead of making it a library function. Doing so is a concession to the
fact that we’re building this interpreter one chapter at a time and want to
be able to play with it before it’s all done. To make print a library
function, we’d have to wait until we had all of the machinery for defining
and calling functions before we could witness any
side effects.
New syntax means new grammar rules. In this chapter, we finally gain the ability
to parse an entire Lox script. Since Lox is an imperative, dynamically typed
language, the “top level” of a script is simply a list of statements. The new
rules are:
The first rule is now program, which is the starting point for the grammar and
represents a complete Lox script or REPL entry. A program is a list of
statements followed by the special “end of file” token. The mandatory end token
ensures the parser consumes the entire input and doesn’t silently ignore
erroneous unconsumed tokens at the end of a script.
Right now, statement only has two cases for the two kinds of statements we’ve
described. We’ll fill in more later in this chapter and in the following ones.
The next step is turning this grammar into something we can store in memory—syntax trees.
There is no place in the grammar where both an expression and a statement are
allowed. The operands of, say, + are always expressions, never statements. The
body of a while loop is always a statement.
Since the two syntaxes are disjoint, we don’t need a single base class that they
all inherit from. Splitting expressions and statements into separate class
hierarchies help us find dumb mistakes like passing a statement to a method that
expects an expression (granted you configured your IDE to catch such things).
That means a new base class for statements. As our elders did before us, we will
use the cryptic name “Stmt”. Just as before, we will define them using
dataclasses.
# lox/ast.py after ExprclassStmt:
"""Abstract Base Class for statements"""@dataclassclassExpression(Stmt):
expression: Expr@dataclassclassPrint(Stmt):
expression: Expr
We will also create a special Stmt node that represents whole programs:
# lox/ast.py after Stmt@dataclassclassProgram(Stmt):
statements: list[Stmt]
The parser’s parse() function that parses and returns a single expression was
a temporary hack to get the last chapter up and running. Now that our grammar
has the correct starting rule, program, we can turn parse() into the real
deal.
This parses a series of statements, as many as it can find until it hits the end
of the input. This is a pretty direct translation of the program rule into
recursive descent style.
A program is a list of statements, and we parse one of those statements using
this method:
A little bare bones, but we’ll fill it in with more statement types later. We
determine which specific statement rule is matched by looking at the current
token. A print token means it’s obviously a print statement.
If the next token doesn’t look like any known kind of statement, we assume it
must be an expression statement. That’s the typical final fallthrough case when
parsing a statement, since it’s hard to proactively recognize an expression from
its first token.
Each statement kind gets its own method. First print:
# lox/parser.py Parser methoddefprint_statement(self) -> Print:
self.consume("PRINT", "Expect 'print' keyword.")
value = self.expression()
self.consume("SEMICOLON", "Expect ';' after value.")
returnPrint(value)
Since we usually call print_statement() from a within statement(), we could
simply consume the first token knowning that it would be a print. Parsing the
whole statement without such assumptions, makes things easier to isolate for
testing and reuse, albeit with at a very small cost in performance. We parse the
subsequent expression, consume the terminating semicolon, and emit the syntax
tree.
If we didn’t match a print statement, we must have one of these:
We’re running through the previous couple of chapters in microcosm, working our
way through the front end. Our parser can now produce statement syntax trees, so
the next and final step is to interpret them. As in expressions, we use
singledistpatch, but now we have a slightly different interface: statements do
not produce returning values, so we need to adapt our eval-like function to
account for this.
Rather than patching eval to handle Stmt arguments, we will create an
entirely different funtion called exec. This is consistent with Python’s own
nomeclature in which the builtin eval handles Python code that represents
expressions and exec handles Python code that represent whole programs and
statements.
# lox/interpreter.py@singledispatchdefexec(stmt: Stmt, env: Env) -> None:
msg = f"exec not implemented for {type(stmt)}"raiseTypeError(msg)
We have two statement types, and we need an implementation method for each. The
easiest is expression statements.
We evaluate the inner expression using our existing eval() method and
discard the value.
The print statement’s visit method isn’t much different.
# lox/interpreter.py after exec()@exec.registerdef_(stmt: Print, env: Env) -> None:
value = eval(stmt.expression, env)
print(stringify(value))
Before discarding the expression’s value, we convert it to a string using the
stringify() function we introduced in the last chapter and then dump it to
stdout.
We must provide the implementation of exec(Program) to allow our interpreter
to handle whole programs.
Basically just plumbing the new syntax through. OK, fire up the interpreter and
give it a try. At this point, it’s worth sketching out a little Lox program in a
text file to run as a script. Something like:
print"one";
printtrue;
print2 + 1;
It almost looks like a real program! Note that the REPL, too, now requires you
to enter a full statement instead of a simple expression. Don’t forget your
semicolons.
Now that we have statements, we can start working on state. Before we get into
all of the complexity of lexical scoping, we’ll start off with the easiest kind
of variables—globals. We need two new constructs.
A variable declaration statement brings a new variable into the world.
varbeverage = "espresso";
This creates a new binding that associates a name (here “beverage”) with a
value (here, the string "espresso").
Once that’s done, a variable expression accesses that binding. When the
identifier “beverage” is used as an expression, it looks up the value bound
to that name and returns it.
printbeverage; // "espresso".
Later, we’ll add assignment and block scope, but that’s enough to get moving.
As before, we’ll work through the implementation from front to back, starting
with the syntax. Variable declarations are statements, but they are different
from other statements, and we’re going to split the statement grammar in two to
handle them. That’s because the grammar restricts where some kinds of statements
are allowed.
The clauses in control flow statements—think the then and else branches of an
if statement or the body of a while—are each a single statement. But that
statement is not allowed to be one that declares a name. This is OK:
if (monday) print"Ugh, already?";
But this is not:
if (monday) varbeverage = "espresso";
We could allow the latter, but it’s confusing. What is the scope of that
beverage variable? Does it persist after the if statement? If so, what is
its value on days other than Monday? Does the variable exist at all on those
days?
Code like this is weird, so C, Python, and friends all disallow it. It’s as if
there are two levels of “precedence” for statements.
Some places where a statement is allowed—like inside a block or at the top
level—allow any kind of statement, including declarations. Others allow only
the “higher” precedence statements that don’t declare names.
To accommodate the distinction, we add another rule for kinds of statements that
declare names.
Declaration statements go under the new declaration rule. Right now, it’s only
variables, but later it will include functions and classes. Any place where a
declaration is allowed also allows non-declaring statements, so the
declaration rule falls through to statement. Obviously, you can declare
stuff at the top level of a script, so program routes to the new rule.
Like most statements, it starts with a leading keyword. In this case, var.
Then an identifier token for the name of the variable being declared, followed
by an optional initializer expression. Finally, we put a bow on it with the
semicolon.
To access a variable, we define a new kind of primary expression.
Before we parse variable statements, we need to shift around some code to make
room for the new declaration rule in the grammar. The top level of a program
is now a list of declarations, so the entrypoint method to the parser changes.
Hey, do you remember way back in that earlier chapter when we put the
infrastructure in place to do error recovery? We are finally ready to hook that
up. We replace the entry point to the parser with this:
# lox/parser.py the while loop in the parse()
...
whilenotparser.is_at_end():
try:
statements.append(parser.declaration())
exceptLoxSyntaxError:
parser.synchronize()
...
This declaration() method is the method we call repeatedly when parsing a
series of statements in a block or a script, so it’s the right place to
synchronize when the parser goes into panic mode. We capture any LoxSyntaxError
exceptions and initialize the error recovery. The try/except is used to keep
parsing even when we hit an error, and if an error occurs, this gets it back to
trying to parse the beginning of the next statement or declaration.
After the parsing is done, we need to announce all errors, if any, to the user.
We do this by checking the self.errors list and raising an LoxStaticError
when appropriate. This requires an import:
# lox/parser.py at the topfromlox.errorsimportLoxStaticError
And a small verification just before returning:
# lox/parser.py parse() before the return
...
ifparser.errors:
raiseLoxStaticError(parser.errors)
...
The real parsing happens inside the declaration() method. First, it looks to
see if we’re at a variable declaration by looking for the leading var keyword.
If not, it falls through to the existing statement() method that parses
print and expression statements.
Remember how statement() tries to parse an expression statement if no other
statement matches? And expression() reports a syntax error if it can’t parse
an expression at the current token? That chain of calls ensures we report an
error if a valid declaration or statement isn’t parsed.
When the parser matches a var token, it branches to:
As always, the recursive descent code follows the grammar rule. It first
consumes the var keyword and then the identifier token for the variable name.
Then, if it sees an = token, it knows there is an initializer expression and
parses it. Otherwise, it creates an initializer that represents null. Finally,
it consumes the required semicolon at the end of the statement. All this gets
wrapped in a Var syntax tree node and we’re groovy.
Parsing a variable expression is even easier. In primary(), we look for an
identifier token.
# lox/parser.py Parser.primary if chainifself.match("IDENTIFIER"):
returnVariable(self.previous())
That gives us a working front end for declaring and using variables. All that’s
left is to feed it into the interpreter. Before we get to that, we need to talk
about where variables live in memory.
The bindings that associate variables to values need to be stored somewhere.
Ever since the Lisp folks invented parentheses, this data structure has been
called an environment.
You can think of it like a map where the keys are
variable names and the values are the variable’s, uh, values. In fact, that’s
how we’ll implement it in Python. We could stuff that map and the code to manage
it right into Interpreter, but since it forms a nicely delineated concept, we’ll
pull it out into its own class.