Files
PPL-Fall-2018/Documents/implementation.tex
2018-10-15 12:47:14 -06:00

758 lines
26 KiB
TeX
Executable File

\input pictex
\input /Users/Shared/TeX/defs
\advance \hoffset 0.25 true in
\def\inclassexercise{\item{$\Rightarrow$\quad\quad} {\bf In-Class Exercise:}\quad}
\def\borderline{\medskip\hrule\bigskip}
\startDoc{CS 3210 Fall 2018}{25}{Implementing a Programming Language}
{\bigboldfont A Brief Look at More Serious Language Implementation}
\bigskip
We have already played some with language design and implementation---you were asked in Project 1
to implement VPL, and I suffered greatly implementing Jive. But, those languages have relatively simple
structure and can be described just
using finite automata. Now we want to learn a little about implementing a language that is described using
a {\it context free grammar}. It turns out that if a language is described using a finite automaton to describe
its meaningful chunks, and a context free grammar of the right form to describe its main structure,
then we can implement the language using {\it recursive descent parsing}.
\bigskip
\Indent{.5}
Not that long ago,
probably every computer science program had a rather nasty required course known as
``the compiler course.'' As this technology became more well-known and accepted, this
course gradually disappeared from the curriculum, but scraps of it remain in this course.
\medskip
We will make our work much easier by mostly
ignoring issues of {\it efficiency}, both in space and time.
For example, a real compiler tries to recover from errors and go on to report other errors, but
we will be happy to detect one error and stop. And, a real compiler tries to generate code that
uses as little memory as is reasonable, and runs as quickly as is reasonable.
We won't be crazy about it in our upcoming work, but we will compromise efficiency whenever it saves us
work.
\medskip
\Outdent
{\bigboldfont Language Specification}
\bigskip
To specify a programming language, we have to specify its {\it grammar}---what sequences of symbols
(or diagrammatic elements, if it's a non-textual language) form a syntactically legal program, and its
{\it semantics}---the meaning of those sequences of symbols.
\medskip
To precisely specify the grammar of our language, we need to do some fairly abstract work (all of which overlaps the
CS 3240 course, but with a more practical focus).
\medskip
We have already seen a little of both finite automata and context free grammars, but the following will be
more detailed and organized.
\medskip
As we do the abstract stuff, we will have in mind two languages. The first will be a
simple calculator language. The second will be the language that you will be asked to
implement in Project 2. Both of these languages will be designed and specified in the
upcoming work. The implementation of the simple calculator language through
a lexical phase and
recursive descent parsing will be worked out fully as an example, leaving you to do
the same things on Project 2.
\border
{\bf Definition of a Language}
\medskip
Given a finite set, which is known as an {\it alphabet} (and can be thought of as some set
of ordinary symbols), we define a {\it string\/}
to be any finite sequence of members of the alphabet, and we define a {\it language\/} to
be any set of strings over the alphabet.
\medskip
\In
For this course, we are interested in situations where a single string is a program in some
programming language, and the language is the set of all syntactically legal programs.
\medskip
\Out
\border
{\bf Finite Automata}
\medskip
We already introduced finite automata, so let's do some examples that will
be used in the simple calculator language.
\border
{\bf Exercise 9}
\medskip
Working in your small group, draw a finite automaton for each of the following languages:
\medskip
\Indent{1}
\item{a.} all strings that start with a lowercase letter, followed by zero or more letters or digits
\medskip
\item{b.} all strings that represent a real number in the way usually
used by humans (no scientific notation allowed), say in a math class.
\border
\Outdent
{\bf The Concept of Lexical Analysis as Just a First Step in Parsing}
\medskip
It turns out that the FA technique is not expressive enough to describe the syntax we want
to have for a typical programming language.
\medskip
\In
Here's a proof of this fact: we clearly want to have, in a typical
imperative language, the ability to write expressions
such as $((((((a))))))$, where the number of extra parentheses is unlimited. Our parser needs
to be able to check whether there are as many left parentheses as right. But, the FA technique
can't say this. Suppose to the contrary that we have an FA that accepts all strings that
have $n$ left parentheses followed by {\tt a} followed by $n$ right parentheses. Imagine that
you are looking at this wonderful diagram. Since it is a {\it finite\/} state automaton, it has a
finite number of states, say $M$. Now, feed this machine a string that has more than $M$
left parentheses, followed by an {\tt a}, followed by the same number of right parentheses.
Since there are only $M$ states, in processing the left parentheses, we will visit some state
twice, and there will be a loop in the machine that processes just one or more left parentheses.
Clearly this loop means that the machine will accept strings that it shouldn't accept.
\medskip
\Out
Since the FA technique can't describe all the syntactic features we want in a programming
language, we will be forced to develop a more powerful approach. But, the FA technique
is still useful, because we will use it for the first phase of parsing, known as {\it lexical analysis}.
This is a very simple idea: we take the stream of physical symbols and use an FA to
process them, producing either an error message or a stream of {\it tokens}. A token is simply
a higher-level symbol that is composed of one or more physical symbols. Then the
remaining parsing takes place on the sequence of tokens. This is done purely as a
matter of convenience and efficiency, because the technique we will develop to do the parsing
is fully capable of doing the tokenizing part as well.
\border
\vfil\eject
{\bigboldfont Formal Grammars and Parsing}
\medskip
We will now learn a technique for specifying the syntax of a language that is more powerful
than the FA approach. One limited version of this approach will let us specify most of the
syntax of a typical programming language.
\bigskip
A {\it formal grammar\/} consists of a set of {\it non-terminal symbols}, a set of {\it terminal symbols},
and a set of {\it production rules}. One non-terminal symbol is designated as the {\it start symbol}.
\medskip
\Indent{1}
As we saw in my specification of Jive, we will often use symbols written between angle brackets,
like {\tt <expr>}, to represent
non-terminal symbols.
Terminal symbols will be tokens produced by the lexical analysis phase.
\medskip
For our early example we will tend to use uppercase letters for non-terminal symbols and
lowercase letters for terminal symbols.
\Outdent
A production rule has one or more symbols (both terminal and non-terminal allowed) known
as the ``left hand side,'' followed
by a right arrow, followed by one or more symbols known as the ``right hand side.''
\medskip
A {\it derivation\/} starts with the start symbol and proceeds by replacing any substring of
the current string that occurs as the left hand side of a rule by the right hand side of that rule,
continuing until a string is reached that has only terminal symbols in it. At this point, that
string of terminal symbols has been proven to be in the language accepted by the grammar.
\medskip
{\bf A Simple Example}
\medskip
Here is a simple grammar, where the non-terminals are $S$, $A$, and $B$, and the
$S$ is the start symbol,
and the terminal symbols are $a$, $b$, $c$, and $d$.
\medskip
$ S \rightarrow AB $
$ A \rightarrow a A b $
$ A \rightarrow c $
$ B \rightarrow d $
$ B \rightarrow d B $
\medskip
Here is a sample derivation of a string from this grammar, where each line
involves one substitution:
\medskip
$S$
$AB$
$ a A b B$
$ a a A b b B$
$ a a c b b B $
$ a a c b b d B $
$ a a c b b d d B $
$ a a c b b d d d $
\medskip
This derivation shows that {\tt aacbbddd} is in the language accepted by this grammar.
\medskip
\doit Determine the language accepted by this grammar. Note that it
has parts that could have been accomplished by the FA technique, and parts that
could not.
\border
{\bf Context Free Grammars}
\medskip
This formal grammar technique
is a very general tool, but we will focus on a special version known as a
{\it context free grammar} where the left hand side of a rule can only be a single non-terminal
symbol.
\medskip
Context free grammars are nice because on each step of a derivation, the only question is,
which non-terminal symbol in the current string should we replace by using a rule where
it occurs as the left hand side.
\medskip
We can now think of what we really mean by parsing. For a context free grammar, the
challenge is to take a given string---an alleged program---and try to find a derivation.
\medskip
Context free grammars are used because they are powerful enough to specify most of
what we want in the syntax of a language, but are still amenable to being parsed {\it efficiently}.
\medskip
It is also important to note that for the languages people tend to use, apparently, the
semantics (meaning) of a program are closely related to the syntactic structure.
\medskip
\Indent{1}
Here is a feature of a typical programming language that isn't captured, typically, by the
context free grammar that specifies the syntax of the language: local variables in Java
have to be defined before they are used. Typically the context free grammar
says that the body of a method consists of a sequence of one or more statements, and
statements are things like {\tt int x;} and {\tt x = 3;}, but the grammar doesn't specify that
one has to occur before the other. Those rules of the language have to be handled in
some other way.
\medskip
\Outdent
For context free grammars, we can improve on the notation of a derivation a lot by
the {\it parse tree\/} concept. Instead of recopying the current string on each step, we
show the replacement of a left hand side non-terminal symbol by writing a child in the
tree for each symbol in the right hand side. Here is a parse tree for the first example
given:
\medskip
$$
\beginpicture
\setcoordinatesystem units <1true pt,1true pt>
\put {S} at 150.0 120.0
\put {A} at 60.0 90.0
\put {B} at 210.0 90.0
\put {a} at 0.0 60.0
\put {A} at 60.0 60.0
\put {b} at 120.0 60.0
\put {d} at 180.0 60.0
\put {B} at 270.0 60.0
\put {a} at 30.0 30.0
\put {A} at 60.0 30.0
\put {b} at 90.0 30.0
\put {d} at 240.0 30.0
\put {B} at 300.0 30.0
\put {c} at 60.0 0.0
\put {d} at 300.0 0.0
% Edge from 1 to 2:
\plot 141.0 117.0 68.0 92.0 /
\put {} at 141.0 117.0
% Edge from 1 to 3:
\plot 158.0 115.0 201.0 94.0 /
\put {} at 158.0 115.0
% Edge from 2 to 4:
\plot 51.0 85.0 8.0 64.0 /
\put {} at 51.0 85.0
% Edge from 2 to 5:
\plot 60.0 81.0 60.0 69.0 /
\put {} at 60.0 81.0
% Edge from 2 to 6:
\plot 68.0 85.0 111.0 64.0 /
\put {} at 68.0 85.0
% Edge from 3 to 7:
\plot 203.0 83.0 186.0 66.0 /
\put {} at 203.0 83.0
% Edge from 3 to 8:
\plot 218.0 85.0 261.0 64.0 /
\put {} at 218.0 85.0
% Edge from 5 to 9:
\plot 53.0 53.0 36.0 36.0 /
\put {} at 53.0 53.0
% Edge from 5 to 10:
\plot 60.0 51.0 60.0 39.0 /
\put {} at 60.0 51.0
% Edge from 5 to 11:
\plot 66.0 53.0 83.0 36.0 /
\put {} at 66.0 53.0
% Edge from 8 to 12:
\plot 263.0 53.0 246.0 36.0 /
\put {} at 263.0 53.0
% Edge from 8 to 13:
\plot 276.0 53.0 293.0 36.0 /
\put {} at 276.0 53.0
% Edge from 10 to 14:
\plot 60.0 21.0 60.0 9.0 /
\put {} at 60.0 21.0
% Edge from 13 to 15:
\plot 300.0 21.0 300.0 9.0 /
\put {} at 300.0 21.0
\endpicture
$$
\border
\vfil\eject
{\bf A Bad Way to Try to Express Expressions}
\medskip
Consider the following grammar, where $E$ is the start symbol, and
$V$ and $N$ are viewed as terminal symbols namely tokens representing
variables and numeric literals, and other symbols that are not uppercase letters
are viewed as tokens.
\medskip
$ E \rightarrow N$
$ E \rightarrow V$
$ E \rightarrow E + E $
$ E \rightarrow E * E $
$ E \rightarrow (E)$
\medskip
This grammar seems reasonable, but it turns out to be a rather poor way to specify
a language consisting of all mathematical expressions using variables, numeric literals,
addition, multiplication, and parentheses. For one thing, the grammar is ambiguous---there
are strings that have significantly different parse trees. For another, the grammar has no
concept of operator precedence.
\medskip
\doit Come up with some expressions and produce parse trees for them.
Verify the complaints just made.
\border
{\bf A Classic Example of a Context Free Grammar}
\medskip
Now consider this grammar for the language of simple expressions,
where $E$ is the start symbol, and
$V$ and $N$ are viewed as terminal symbols, namely tokens representing
variables and numeric literals, and other symbols that are not uppercase letters
are viewed as tokens.
\medskip
$E \rightarrow T$
$E \rightarrow T+E$
$T \rightarrow F$
$ T \rightarrow F * T$
$F \rightarrow V$
$F \rightarrow N$
$F \rightarrow (E)$
\medskip
\doit This grammar captures the syntax of an {\it expression} involving two
binary operators with different precedence. Produce parse trees for some expressions and
note that the grammar is not ambiguous and enforces operator precedence.
\border
{\bf Parse Trees as Code}
\medskip
The previous example suggests a powerful idea: if someone gives you a parse tree for an
expression, and values for all the variables occurring, you can easily compute the value
of the expression---and you can easily write code that interprets the parse tree.
\medskip
Thus, in a very real sense, the tree structure is a better programming language than
the language described by the context free grammar, except for one big catch: it's hard to
read and write parse trees, compared to reading and writing strings.
\border
{\bf Exercise 10}
\medskip
Working in your small group, produce a parse tree for the following expression, using the good
expression grammar.
Then, assuming that $a=12$, $b=7$, and $c=2$, show how the parse tree can be used to
compute the value of the expression---note how the value of each node in the parse tree
is obtained from the values of its children.
\medskip
Use this expression:
$$ 7*a + (c*a*c+4*a)*b $$
\border
{\bf Designing and Specifying the Simple Calculator Language}
\medskip
Now we want to design (and then implement) a language that allows the user to
perform a sequence of computations such as a person could do on a calculator.
But, we want the language to allow use of expressions and assignment statements.
\medskip
\doit As a whole group, invent a name for this language, and write some example programs
so everyone will see the sorts of things we want to allow, and the sorts of things we don't want
to allow.
\border
{\bf Exercise 11}
\medskip
Working in your small group, specify the simple calculator language by determining its
tokens (categories of conceptual symbols, such as probably {\tt <var>} for variable),
drawing an FA for each kind of token, and writing a context free grammar for the
language.
\medskip
Use {\tt <program>} for the start variable, which should produce any possible program in the
language.
\border
\doit Working back together as a whole group, write down the final specification
for the simple calculator language.
\border
{\bf Implementing the Lexical Phase}
\medskip
\doit As a whole group, produce a Java class named {\tt Lexer} that
will take a file containing a program in the simple calculator language as input and produce a sequence of tokens.
In addition to producing this stream of tokens, {\tt Lexer} should have the ability to ``put back'' a token, for convenience
in the recursive descent parsing phase. The point is that we will produce a grammar that will allow us to
look ahead at the next token, thinking it will be part of the entity we are producing, and then realize from that next token
that we are done with the entity we were producing, so it is convenient to put back the look ahead token.
\vfil\eject
%\bye
{\bf Implementing The Simple Calculator Language}
\medskip
\doit
Working as a whole group,
finish the draft design of the simple calculator language,
and write down a context free grammar to specify it.
\medskip
\Indent{1}
As we do this, we want to try to write the rules so that they process symbols on the
left, leaving the recursive parts on the right.
For example, if we want to say ``1 or more {\tt a}'s'' we could either use the
rules
$$ A \rightarrow Aa \; \vert \; a $$
or we could say
$$ A \rightarrow aA \; \vert \; a $$
with the same result from a theoretical perspective.
But, because we want to use the {\it recursive descent parsing\/} technique (there are other
parsing techniques that we will not study that like grammars in different forms),
we will want to use the second approach.
\medskip
\Outdent
\border
\doit Working as a whole group, begin work on a class {\tt Parser} that will take
a {\tt Lexer} as a source of tokens and produce a {\it parse tree\/} that represents a simple calculator
language program in a way that it can be directly executed.
\medskip
\Indent{1}
As a practical matter, to save some time
we will begin with some old code that I produced for a different language and modify it massively
to do the job for our new language.
\medskip
Note that along with the {\tt Lexer} and {\tt Parser} classes, this old code includes {\tt Token} and
{\tt Node} classes that will be used extensively in our work.
\medskip
The classes {\tt Basic}, {\tt Camera}, and {\tt TreeViewer} support visualizing a parse tree for
debugging and testing purposes.
\medskip
\Outdent
As we will see, the recursive descent parsing technique involves writing a method for every non-terminal
symbol in the grammar. Each of these methods will consume a sequence of tokens and produce a
node that is the root of a tree that is the translation of the corresponding chunk of code.
\medskip
Also, these methods will return {\tt null} if they realize that the next few tokens don't fit what they are
expecting to see.
\medskip
We will not take the time to be very careful about detecting and reporting errors in the source code.
We will basically assume that the source code is correct. But, when it makes sense to do so, we will
notice errors and halt the parsing process with some error message to the programmer.
\border
\vfil\eject
{\bf Final (?) Finite Automata for Lexer for Corgi}
\bigskip
\epsfbox{../Secret/Corgi/Figures/corgiFA.1}
\border
\doit Instructor will briefly discuss {\tt Lexer} and {\tt Parser} in preparation for Exercises 12 and 13.
\border
{\bf Exercise 12}
Working in your small group, write the code for state 3, 4, and 5 in the {\tt Lexer.java} file for Corgi.
To help you understand {\tt Lexer} and do this Exercise effectively, a number of trees have been
killed to provide you with a printout of the existing code on the next pages.
\vfil\eject
\count255 1
\numberedListing{../Corgi/Lexer.java}{\tt}
\border
{\bf Exercise 13}
Working in your small group, write the code for {\tt parseFactor} in
the class {\tt Parser}, listed on the next few pages.
\bigskip
Here is the draft context free grammar for Corgi (note some changes from our earlier version):
\medskip
{\baselineskip 0.95 \baselineskip
\listing{../Corgi/corgiCFG.txt}
\border}
\count255 1
\numberedListing{../Corgi/Parser.java}{\tt}
\vfil\eject
\doit Instructor will discuss the minor changes to {\tt Parser} (listed in its entirety below)
and draw a parse tree for
a Corgi program.
\border
{\bf Exercise 14}
Working in your small groups, construct the parse tree for {\tt quadroots} following the final (I hope!)
version of {\tt Parser}.
\border
{\baselineskip 0.75\baselineskip
\count255 1
\font \smalltt cmtt10 scaled 800
\numberedListing{../Corgi/Parser.java}{\smalltt}
\border
}
{\bf Exercise 15}
Below you will find a partial listing of the {\tt Node} class, namely the partially completed
{\tt execute} and {\tt evaluate} methods.
Working in your small groups,
write the missing code.
\medskip
Note that nodes have instance variables {\tt kind}, {\tt info} (both strings), and
{\tt first} and {\tt second} (both nodes).
\medskip
Also, there is a {\tt MemTable} class that has public methods
{\tt void store( String name, double value)}
and
{\tt double retrieve( String name)}
\medskip
If you need further details about any of the classes in the {\tt Corgi} project
(namely {\tt Corgi}, {\tt Token}, {\tt Lexer}, {\tt Parser}, {\tt Node}, and {\tt MemTable},
you should look at them online (assuming someone in your group has a computer in class).
\vfil\eject
\count255 136
\numberedListing{../Corgi/partialNode.java}{\tt}
\vfil\eject
{\bigboldfont Designing Project 2}
\medskip
To keep Project 2 manageable (a first serious submission for Project 2 must be submitted by the time of Test 2---October 29),
I have decided to make this project be to simply (we hope!) add some
features to Corgi as follows.
\medskip
\Indent{1}
First, we want to give the Corgi programmer the ability to define functions, and of course to call them.
\medskip
Second, we want to add branching to Corgi.
\medskip
\Outdent
\doit Working as a whole group, design the finite automata for the Corgi {\tt Lexer}, and the
context free grammar for the Corgi {\tt Parser}.
Discuss the features we are not adding to Corgi, and perhaps decide to add some of them.
As part of this design process, write some sample Corgi programs (these will also provide a
starting point for testing your Project 2 work).
\medskip
\doit Once the syntax of the language is specified, try to specify the {\it semantics\/} of the language,
and think about how execution/evaluation of the parse tree will be done.
\bigskip
{\bf Note:} after we have made decisions about the new and improved Corgi,
henceforth to be known as ``Corgi,'' I will document them.
But, you should be able to start working on Project 2 right away, based on the informal
documentation recorded during this class session.
\border
{\bigboldfont Project 2} [10 points] [first serious submission absolutely due by October 29]
\medskip
Implement the extended Corgi language as specified in class and in upcoming documents.
Be alert for corrections to the Corgi specification that may need to be made as work proceeds.
\medskip
Be sure to test your implementation thoroughly.
\medskip
Email me your group's submission with a single {\tt zip} or {\tt jar} file attached, named either
{\tt corgi.jar} or {\tt corgi.zip}. This single file must contain all the source code files.
Be sure to CC all group members on the email that submits your work.
\medskip
I must be able to extract all the Java source files from your submitted file into a folder, and then simply
type at the command prompt:
\medskip
{\tt javac Corgi.java}
{\tt java Corgi test3}
\medskip
in order to run the Corgi program in the file named {\tt test3}.
\border
\vfil\eject
\bye
{\bf Designing Project 2}
\medskip
\doit Discuss possibilities for Project 2 and decide what it should be.
% Then, we'll apply CYK and/or recursive descent to it
\vfil\eject
{\bigboldfont Course Policy Change}
\medskip
Test 2 is canceled, due to the facts that we haven't covered enough material suitable for fair assessment,
and we have just barely gotten to the main topics of language implementation.
\medskip
The test scheduled for April 13 will cover the material we have been and are still working on, along
with some functional programming topics, about equally weighted. The test at the time of the final exam
will cover Clojure and Prolog topics.
\medskip
Your course grade will be figured with the 68\% for Tests divided either into 3 or 4, with the next test counting
as two tests or one, and taking the maximum of the two calculations.
\border
{\bigboldfont Otter}
\medskip
I had been thinking Project 2 would be to finish implementing Otter, but now I'm thinking Project 2 will be
to implement the ``simple example'' we started on February 28th. We will spend a few more class periods
working through implementation of Otter together (with me, unfortunately, doing most of the work outside of
class).
\border
{\bigboldfont Project 2}
\medskip
Implement the simple calculator language (let's call it ``CalcLang'') designed in class, with that design
documented (including the finite automaton for the lexical part of the language, the context free grammar,
and an informal description of the language semantics) in the folder {\tt Project2} at the course web stie.
\medskip
When you are done, you will have an application consisting of several Java classes. You must submit your
work to me as follows:
\medskip
\In
Make sure that the main class---the one that contains the {\tt main} method at which execution starts---is
named exactly {\tt CalcLang}.
\medskip
Clean up your source files for submission by removing all non-portable {\tt package} statements and
all extraneous output.
\medskip
Put all your source files in a folder, named {\tt Project2}.
Create a {\tt jar} file from the command line by moving into the folder that contains the folder {\tt Project2}
and entering
\smallskip
{\tt jar cvfM proj2.jar Project2}
\medskip
Test your work by making a temporary folder somewhere, copying {\tt proj2.jar} into it, moving into it,
and entering
\smallskip
{\tt jar xf proj2.jar}
\smallskip
This should make a folder {\tt Project2} that contains all your source code. Move into it and compile and
run your application.
\medskip
Email me your submission with {\tt proj2.jar} attached at {\tt shultzj@msudenver.edu}
\bigskip
If you are somehow incapable of doing the previous work, you may as a slightly irritating (to me) alternative
make a {\tt zip} file of the folder containing your source code instead of a {\tt jar} file.
\medskip
\Out
\vfil\eject
\bye